[
https://issues.apache.org/jira/browse/HDFS-9820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15543686#comment-15543686
]
Andrew Wang commented on HDFS-9820:
-----------------------------------
Hi Yongjun, thanks for sticking with this one. I had a few meta comments, and
some code comments:
Meta comments:
* I see a TODO about HDFS-10263. How much would it simplify the code here if we
implemented that first? There's a lot of logic about source vs. target due to
how we flip it for the rdiff, and overall not much code sharing with the
existing snapshot diff mechanism.
* Is there a situation where we'd want to pass two different clusters for "src"
and "tgt" to rdiff? They need to both have identical "s1" base states anyway.
Code comments:
* Comment in DistCpOptions says that forward diff is referred to as "Fdiff" but
it's still just "diff" in multiple places. Maybe change to say "referred to as
diff or Fdiff"?
* Unused Preconditions import in DistCpOptions
DistCpSync:
* Extra "//" line on diffMap. I'd also prefer if we made this a javadoc block
comment ({{/** */}}).
* {{preSyncCheck}} could use some explanatory comments about how we flip the
diff for rdiff. I don't know what the "c" in "cfs" stands for also, so could
use a comment as well. Also, is there a way to additionally dedupe the
rdiff/diff checking on the modtime? If not, cfs/cdir aren't saving us much and
might as well just put them in the corresponding part of the if/else block.
* We lost this comment in sync:
{noformat}
// TODO: since we have tmp directory, we can support "undo" with failures
// set the source path using the snapshot path
{noformat}
* getRenameAndDeleteDiffsRdiff, I think we mean "reversal" rather than
"reversion", and "Reversed" rather than "Reverted" in
{{renameDiffsListReverted}}
* TestDistCpSync: what happened to testFallback? New syncAndFail also isn't
used.
I didn't go through all the new tests, will get to that later.
> Improve distcp to support efficient restore to an earlier snapshot
> ------------------------------------------------------------------
>
> Key: HDFS-9820
> URL: https://issues.apache.org/jira/browse/HDFS-9820
> Project: Hadoop HDFS
> Issue Type: New Feature
> Components: distcp
> Reporter: Yongjun Zhang
> Assignee: Yongjun Zhang
> Attachments: HDFS-9820.001.patch, HDFS-9820.002.patch,
> HDFS-9820.003.patch, HDFS-9820.004.patch, HDFS-9820.005.patch
>
>
> A common use scenario (scenaio 1):
> # create snapshot sx in clusterX,
> # do some experiemnts in clusterX, which creates some files.
> # throw away the files changed and go back to sx.
> Another scenario (scenario 2) is, there is a production cluster and a backup
> cluster, we periodically sync up the data from production cluster to the
> backup cluster with distcp.
> The cluster in scenario 1 could be the backup cluster in scenario 2.
> For scenario 1:
> HDFS-4167 intends to restore HDFS to the most recent snapshot, and there are
> some complexity and challenges. Before that jira is implemented, we count on
> distcp to copy from snapshot to the current state. However, the performance
> of this operation could be very bad because we have to go through all files
> even if we only changed a few files.
> For scenario 2:
> HDFS-7535 improved distcp performance by avoiding copying files that changed
> name since last backup.
> On top of HDFS-7535, HDFS-8828 improved distcp performance when copying data
> from source to target cluster, by only copying changed files since last
> backup. The way it works is use snapshot diff to find out all files changed,
> and copy the changed files only.
> See
> https://blog.cloudera.com/blog/2015/12/distcp-performance-improvements-in-apache-hadoop/
> This jira is to propose a variation of HDFS-8828, to find out the files
> changed in target cluster since last snapshot sx, and copy these from
> snapshot sx of either the source or the target cluster, to restore target
> cluster's current state to sx.
> Specifically,
> If a file/dir is
> - renamed, rename it back
> - created in target cluster, delete it
> - modified, put it to the copy list
> - run distcp with the copy list, copy from the source cluster's corresponding
> snapshot
> This could be a new command line switch -rdiff in distcp.
> As a native restore feature, HDFS-4167 would still be ideal to have. However,
> HDFS-9820 would hopefully be easier to implement, before HDFS-4167 is in
> place.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]