[
https://issues.apache.org/jira/browse/HDFS-12090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16339082#comment-16339082
]
Ewan Higgs commented on HDFS-12090:
-----------------------------------
[~virajith], [~rakeshr] thank you both for your feedback!
{quote}When supporting ordered operations, I think {{SyncServiceSatisfier}}
needs to coordinate them (so that Datanodes don't start having additional
coordination).
{quote}
Yes. The SPS has since moved to use a service that is either inside the NN or
sits alongside it. The same approach could be used here.
{quote}However, in the future (once SPS converges), it would be good to look at
how this work can plug into/reuse parts of the SPS/refactor parts of SPS if
necessary. I would hate to have two parallel code paths that do something very
similar (satisfy storage policies).
{quote}
Agreed. The SPS underwent considerable changes in the past few months so it
would not have been possible to track it.
{quote}The data backup path is concerning. It bypasses the DN write path and
one Datanode backs up a whole file (in
{{SyncServiceSatisfierWorker#backupFile}})
{quote}
Agreed. It's possible to make a Multipart-Multinode uploader. This would have
three parts: init, putPart, and complete. The init and complete take place in
the task tracker as they are merely metadata operations and have a specific
ordering (init first, complete only after all the parts are done). The putPart
needs to take place in the DNs but can be done in any order / parallel.
{quote}Need for a throttling mechanism so as to limit the load on the NN.
{quote}
Agreed.
{quote}FYI, we have moved the C-DN logic to Namenode and now
{{StoragePolicySatisfier}} is doing the file level co-ordination(tracking the
file blocks)
{quote}
This is great news. I'm going through the SPS code now. If I understand
correctly, we will want to integrate our work by using the same scanner and the
same task scheduler. The two main interfaces we need to work with areĀ
{{scanAndCollectFileIds}} and {{submitMoveTask}}. These are very generic so we
could potentially implement them, but it's not entirely clear yet.
If we are to integrate then there would need to be some changes to the SPS. You
already changed to work at the file level. The other three issues I currently
see are:
* File writing, like block movements, can always be done in parallel from the
DNs. However, metadata operations from HDFS-12090 needs to be ordered. e.g.
Deletes must happen before creates; directory creation must take place before
we try to place files in them; renames can happen before a directory deletion
or after a directory creation. Keep in mind that when we perform a
multipart-multinode upload, the multipart init and complete also need to be
ordered. But I think we can do them from the tracker.
* Are there hooks in the tracker for multinode-multipart uploads to perform
the init and complete?
* HDFS-12090 requires knowledge of deleted files and directories as well as
renames so, if I've understood correctly, we won't be able to share a scanner
unless the SPS also uses a snapshot based approach. I'm not sure if this would
mean the Context API would need to change or if it can be done with the
existing APIs since {{ExternalSPSFileIDCollector}} has access to a
{{DistributedFileSystem}} which can perform the snapshotting and provide diffs.
As I said, I'm still going through the SPS code, so please let me know if I've
understood correctly.
> Handling writes from HDFS to Provided storages
> ----------------------------------------------
>
> Key: HDFS-12090
> URL: https://issues.apache.org/jira/browse/HDFS-12090
> Project: Hadoop HDFS
> Issue Type: New Feature
> Reporter: Virajith Jalaparti
> Priority: Major
> Attachments: HDFS-12090-Functional-Specification.001.pdf,
> HDFS-12090-Functional-Specification.002.pdf,
> HDFS-12090-Functional-Specification.003.pdf, HDFS-12090-design.001.pdf,
> HDFS-12090.0000.patch, HDFS-12090.0001.patch
>
>
> HDFS-9806 introduces the concept of {{PROVIDED}} storage, which makes data in
> external storage systems accessible through HDFS. However, HDFS-9806 is
> limited to data being read through HDFS. This JIRA will deal with how data
> can be written to such {{PROVIDED}} storages from HDFS.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]