[ 
https://issues.apache.org/jira/browse/HDFS-7535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jing Zhao updated HDFS-7535:
----------------------------
    Attachment: HDFS-7535.003.patch

Thanks for the review, Nicholas! Update the patch to address your comments.

bq. In DistCpSync.moveToTmpDir, why move the paths to tmp for the delete 
operations?

So I'm thinking about if we can support "undo" for this functionality in the 
future. I.e., if the user hits any issue while applying the diff, if we move 
all the files/dirs to the tmp dir, we can still have a chance to undo all the 
changes.

bq. Would it be able to preserve other attributes for the "-p" option?

The attributes preservation will be covered later in the CopyMapper, which 
calls {{DistCpUtils#preserve}}. I will do some system tests and maybe add a new 
unit test to verify.

bq. Is it better to throw an exception instead since the user may not want to 
fallback?

My current concern is that if this functionality is used by applications like 
Falcon and Oozie, it may be more convenient if we can include the fallback 
logic inside of the distcp. If we directly throw exceptions then these 
applications need to have the capability to change the options to avoid using 
snapshot diff.

> Utilize Snapshot diff report for distcp
> ---------------------------------------
>
>                 Key: HDFS-7535
>                 URL: https://issues.apache.org/jira/browse/HDFS-7535
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: distcp, snapshots
>            Reporter: Jing Zhao
>            Assignee: Jing Zhao
>         Attachments: HDFS-7535.000.patch, HDFS-7535.001.patch, 
> HDFS-7535.002.patch, HDFS-7535.003.patch
>
>
> Currently HDFS snapshot diff report can identify file/directory creation, 
> deletion, rename and modification under a snapshottable directory. We can use 
> the diff report for distcp between the primary cluster and a backup cluster 
> to avoid unnecessary data copy. This is especially useful when there is a big 
> directory rename happening in the primary cluster: the current distcp cannot 
> detect the rename op thus this rename usually leads to large amounts of real 
> data copy.
> More details of the approach will come in the first comment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to