[
https://issues.apache.org/jira/browse/HADOOP-14137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15892319#comment-15892319
]
Steve Loughran commented on HADOOP-14137:
-----------------------------------------
A few more thoughts
# the listing should be randomized before the copy begins. We've seen
performance benefits there related to hotspots of object stores
# if distcp could also generate the listing file of everything it copies,
including path at far end, the checksums of both source and dest, then it could
be used for incremental copying between any two filesystems each of which
supported any checksum mech —even when they were different between the two
separate filesystems. Instead of verifying that dest checksum == src checksum,
we could verify that src == cached source checksum and that dest==cached dest
value, Any difference would trigger a copy
> Faster distcp by taking file list from fsimage or -lsr result
> -------------------------------------------------------------
>
> Key: HADOOP-14137
> URL: https://issues.apache.org/jira/browse/HADOOP-14137
> Project: Hadoop Common
> Issue Type: New Feature
> Components: tools/distcp
> Reporter: Zheng Shao
>
> DistCp is very slow to start when the src directory has a huge number of
> subdirectories. In our case, we already have the directory listing (via
> "hdfs oiv -i fsimage" or via nightly "hdfs dfs -lr -r /" dumps), and we would
> like to use that instead of doing realtime listing on the NameNode.
> The "-f" option doesn't help in this case because it would try to put
> everything into a single flat target directory.
> We'd like to introduce a new option "-list <file>" for distcp. The <file>
> contains the result of listing the src directory.
> In order to achieve this, we plan to:
> 1. Add a new CopyListing class PregeneratedCopyListing similar to
> SimpleCopyListing which doesn't "-ls -r" into the directory, but takes the
> listing via "-list"
> 2. Add an option "-list <file>" which will automatically make distcp use the
> new PregeneratedCopyListing class.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]