[ 
https://issues.apache.org/jira/browse/HDFS-12090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16339082#comment-16339082
 ] 

Ewan Higgs commented on HDFS-12090:
-----------------------------------

[~virajith], [~rakeshr] thank you both for your feedback!
{quote}When supporting ordered operations, I think {{SyncServiceSatisfier}} 
needs to coordinate them (so that Datanodes don't start having additional 
coordination).
{quote}
Yes. The SPS has since moved to use a service that is either inside the NN or 
sits alongside it. The same approach could be used here.
{quote}However, in the future (once SPS converges), it would be good to look at 
how this work can plug into/reuse parts of the SPS/refactor parts of SPS if 
necessary. I would hate to have two parallel code paths that do something very 
similar (satisfy storage policies).
{quote}
Agreed. The SPS underwent considerable changes in the past few months so it 
would not have been possible to track it.
{quote}The data backup path is concerning. It bypasses the DN write path and 
one Datanode backs up a whole file (in 
{{SyncServiceSatisfierWorker#backupFile}})
{quote}
Agreed. It's possible to make a Multipart-Multinode uploader. This would have 
three parts: init, putPart, and complete. The init and complete take place in 
the task tracker as they are merely metadata operations and have a specific 
ordering (init first, complete only after all the parts are done). The putPart 
needs to take place in the DNs but can be done in any order / parallel.
{quote}Need for a throttling mechanism so as to limit the load on the NN.
{quote}
Agreed.
{quote}FYI, we have moved the C-DN logic to Namenode and now 
{{StoragePolicySatisfier}} is doing the file level co-ordination(tracking the 
file blocks)
{quote}
This is great news. I'm going through the SPS code now. If I understand 
correctly, we will want to integrate our work by using the same scanner and the 
same task scheduler. The two main interfaces we need to work with areĀ 
{{scanAndCollectFileIds}} and {{submitMoveTask}}. These are very generic so we 
could potentially implement them, but it's not entirely clear yet.

If we are to integrate then there would need to be some changes to the SPS. You 
already changed to work at the file level. The other three issues I currently 
see are:
 * File writing, like block movements, can always be done in parallel from the 
DNs. However, metadata operations from HDFS-12090 needs to be ordered. e.g. 
Deletes must happen before creates; directory creation must take place before 
we try to place files in them; renames can happen before a directory deletion 
or after a directory creation. Keep in mind that when we perform a 
multipart-multinode upload, the multipart init and complete also need to be 
ordered. But I think we can do them from the tracker.
 * Are there hooks in the tracker for multinode-multipart uploads to perform 
the init and complete?
 * HDFS-12090 requires knowledge of deleted files and directories as well as 
renames so, if I've understood correctly, we won't be able to share a scanner 
unless the SPS also uses a snapshot based approach. I'm not sure if this would 
mean the Context API would need to change or if it can be done with the 
existing APIs since {{ExternalSPSFileIDCollector}} has access to a 
{{DistributedFileSystem}} which can perform the snapshotting and provide diffs.

As I said, I'm still going through the SPS code, so please let me know if I've 
understood correctly.

> Handling writes from HDFS to Provided storages
> ----------------------------------------------
>
>                 Key: HDFS-12090
>                 URL: https://issues.apache.org/jira/browse/HDFS-12090
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>            Reporter: Virajith Jalaparti
>            Priority: Major
>         Attachments: HDFS-12090-Functional-Specification.001.pdf, 
> HDFS-12090-Functional-Specification.002.pdf, 
> HDFS-12090-Functional-Specification.003.pdf, HDFS-12090-design.001.pdf, 
> HDFS-12090.0000.patch, HDFS-12090.0001.patch
>
>
> HDFS-9806 introduces the concept of {{PROVIDED}} storage, which makes data in 
> external storage systems accessible through HDFS. However, HDFS-9806 is 
> limited to data being read through HDFS. This JIRA will deal with how data 
> can be written to such {{PROVIDED}} storages from HDFS.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to