[ https://issues.apache.org/jira/browse/HDFS-12090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16319396#comment-16319396 ]
Virajith Jalaparti commented on HDFS-12090: ------------------------------------------- Thanks for posting the patch [~ehiggs]. Here are some initial thoughts on the high-level design: # As you note, the current implementation doesn't support ordered operations (e.g., backup of a directory hierarchy to another instance of HDFS). All the operations in a particular snapshot diff happen in parallel across (potentially) multiple datanodes. When supporting ordered operations, I think {{SyncServiceSatisfier}} needs to coordinate them (so that Datanodes don't start having additional coordination). So, the design should make sure that some part of it is capable of handling ordered operations. Having an abstract class that performs the functions handled in {{SyncServiceSatisfier#synchronizeBackupMount}} can be one way to solve this issue.. # The data backup path is concerning. It bypasses the DN write path and one Datanode backs up a whole file (in {{SyncServiceSatisfierWorker#backupFile}}) --- it copies blocks from other datanodes in the cluster and then writes it back to the provides store. Compared to the SPS approach (a DN could be responsible for only 1 block), this approach involves 2 network transfers instead of 1 (the DN has to copy blocks from other DNs and then write it back to the provided store), and cannot benefit from the parallelism of each DN handling one or a few blocks for the file. # The patch seems a completely separate path from the SPS work (HDFS-10285). Given that the SPS is still in a state of flux, this is OK for now. However, in the future (once SPS converges), it would be good to look at how this work can plug into/reuse parts of the SPS/refactor parts of SPS if necessary. I would hate to have two parallel code paths that do something very similar (satisfy storage policies). That said, I think that shouldn't stop progress on this JIRA. # Need for a throttling mechanism so as to limit the load on the NN. Although not immediate, this would be eventually required. Some comments specific to this patch: * In {{SyncTaskScheduler#schedule}}, why have these two separate paths? {code} if (syncTask.operation == SyncTask.Operation.CREATE_FILE) { scheduleOnFirstBlockDatanode(syncTask); } else { scheduleOnRandomDatanode(syncTask); } {code} * Use a builder pattern for creating {{SyncTask}}? * Why use the sync mount and not backup endpoint? That was the terminology used in the latest functional spec. * The method names {{createSync}}, {{removeSync}}, though understandable, are confusing. I think {{createBackupEndPoint}}, {{removeBackupEndPoint}} etc. would be easier to understood (and adhere to the functional spec). > Handling writes from HDFS to Provided storages > ---------------------------------------------- > > Key: HDFS-12090 > URL: https://issues.apache.org/jira/browse/HDFS-12090 > Project: Hadoop HDFS > Issue Type: New Feature > Reporter: Virajith Jalaparti > Attachments: HDFS-12090-Functional-Specification.001.pdf, > HDFS-12090-Functional-Specification.002.pdf, > HDFS-12090-Functional-Specification.003.pdf, HDFS-12090-design.001.pdf, > HDFS-12090.0000.patch > > > HDFS-9806 introduces the concept of {{PROVIDED}} storage, which makes data in > external storage systems accessible through HDFS. However, HDFS-9806 is > limited to data being read through HDFS. This JIRA will deal with how data > can be written to such {{PROVIDED}} storages from HDFS. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org