[ 
https://issues.apache.org/jira/browse/HDFS-9820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15242030#comment-15242030
 ] 

Yongjun Zhang commented on HDFS-9820:
-------------------------------------

Many thanks [~jingzhao]. Good discussion!

1.
{quote}
No. This is incorrect. We allow distcp -diff s1 .. "s2" can be done after the 
copy. See TestDistCpSync#testSyncWithCurrent as an example
{quote}
if some changes is made while we were running distcp or after, but before s2 is 
created, then the stuff copied is not exact the content of s2. Right?

2.
{quote}
This assumption must be verified before the new distcp. Currently we do a 
snapshot diff report on target (between from and ".") to check. This check 
cannot be dropped as in your current patch.
{quote}
I certainly agree that we should do the checking. I emphasized the assumption I 
and II in my last comment. However, since the checking can only be done in the 
beginning of distcp,  if some changes are made before s2 is created, they will 
be missed in the checking. So I think we need to document that no change should 
be made when we do this operation.

3.
{quote}
I mean "" or "." should never be used as the fromState in distcp -diff command, 
otherwise we have no way to verify there is no change happening on target. So 
we actually should use "s2" here.
{quote}
Then in the case HDFS-9820 tries to solve, are you suggesting to create a 
snapshot s2 first (for the sake of doing a check), before reverting it back to 
s1? The issue described in #2 above also applies.

4.
{quote}
This is also wrong. In command line "." is the alias of the current state.
{quote}
I saw distcp was using {{""}}, maybe we should change to stick to using {{"."}}.

5.
{quote}
For any modification/creation happening under a renamed directory, the diff 
report always uses the paths before the rename (as reported by HDFS-10263). 
prepareDiffList changes these paths to new paths after the rename, but when 
applying the reverse diff, we do not need to do this.
{quote}
Renaming x in s1 to y in s2 means that x is the original name before the 
rename, as reported in snapShotDiff(s1, s2), where s1 is fromSS, s2 is toSS;
When we look at the reversion, the rename operation become renaming y in s2 to 
x in s1,   so y should be the original name before the rename. 
as I expect to see from the reports of in snapshotDiff(s2, s1), where s2 is 
fromSS, s1 is toSS. 

However, snapshotDiff(s2, s1) still uses the names in s1 as the original name 
(x in this case, I really expect it to be y), though It does change the order 
operands, comparing with snapshotDiff(s1,s2),  This is the issue I reported in 
HDFS-10263. You can see some example there.

Basically I expect snapshotDiff(fromSS, toSS) to use names in fromSS. In the 
reversion case, it's the "." state. This is the symmetry I was referring to.

Does this explanation make sense?

Thanks again!


> Improve distcp to support efficient restore to an earlier snapshot
> ------------------------------------------------------------------
>
>                 Key: HDFS-9820
>                 URL: https://issues.apache.org/jira/browse/HDFS-9820
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: distcp
>            Reporter: Yongjun Zhang
>            Assignee: Yongjun Zhang
>         Attachments: HDFS-9820.001.patch, HDFS-9820.002.patch, 
> HDFS-9820.003.patch, HDFS-9820.004.patch
>
>
> HDFS-4167 intends to restore HDFS to the most recent snapshot, and there are 
> some complexity and challenges. 
> HDFS-7535 improved distcp performance by avoiding copying files that changed 
> name since last backup.
> On top of HDFS-7535, HDFS-8828 improved distcp performance when copying data 
> from source to target cluster, by only copying changed files since last 
> backup. The way it works is use snapshot diff to find out all files changed, 
> and copy the changed files only.
> See 
> https://blog.cloudera.com/blog/2015/12/distcp-performance-improvements-in-apache-hadoop/
> This jira is to propose a variation of HDFS-8828, to find out the files 
> changed in target cluster since last snapshot sx, and copy these from the 
> source target's same snapshot sx, to restore target cluster to sx.
> If a file/dir is
> - renamed, rename it back
> - created in target cluster, delete it
> - modified, put it to the copy list
> - run distcp with the copy list, copy from the source cluster's corresponding 
> snapshot
> This could be a new command line switch -rdiff in distcp.
> HDFS-4167 would still be nice to have. It just seems to me that HDFS-9820 
> would hopefully be easier to implement.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to