[
https://issues.apache.org/jira/browse/HADOOP-14137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15924025#comment-15924025
]
Steve Loughran commented on HADOOP-14137:
-----------------------------------------
I'm going to give some rough comments, warning that I don't really know distcp
that well, as in: not well enough to vote up a patch. Someone competent in the
code needs to have a look
* Not sure about "PregeneratedFileListing" as a term, makes the code variables
and stuff look complext. How about "existing"?
* without actually doing the plugin modules, could you set it up so that it
won't be too hard to plug in a new parser
* If you switch the logger in {{SimpleCopyListing}} to SLF4J, you can log more
easily; I tend to do that as files get maintained
* if parseFsImageLineToFileStatus gets a problem, it just downgrads to a null.
That should be logged at debug to identify why there's a parse failure.
* there is some logging if null is returned (like 526), but that doesn't
indicate anything, just runs the risk of a bad file printing lots of
"processed" log events.
Issue:
# what to do if there are errors parsing the existing file. Ignore? Or fail? As
"ignore" could hide a serious failure in a backup process
# what to do if a listed file isn't there, or has changed from a file to
directory?
> Faster distcp by taking file list from fsimage or -lsr result
> -------------------------------------------------------------
>
> Key: HADOOP-14137
> URL: https://issues.apache.org/jira/browse/HADOOP-14137
> Project: Hadoop Common
> Issue Type: New Feature
> Components: tools/distcp
> Affects Versions: 2.6.5
> Reporter: Zheng Shao
> Assignee: Zheng Shao
> Fix For: 2.6.6
>
> Attachments: HADOOP-14137.branch26.1.patch,
> HADOOP-14137.branch26.2.patch, HADOOP-14137.branch26.3.patch
>
>
> DistCp is very slow to start when the src directory has a huge number of
> subdirectories. In our case, we already have the directory listing (via
> "hdfs oiv -i fsimage" or via nightly "hdfs dfs -lr -r /" dumps), and we would
> like to use that instead of doing realtime listing on the NameNode.
> The "-f" option doesn't help in this case because it would try to put
> everything into a single flat target directory.
> We'd like to introduce a new option "-listing <file>" for distcp. The <file>
> contains the result of listing the src directory.
> In order to achieve this, we plan to:
> 1. Add a new CopyListing class PregeneratedCopyListing similar to
> SimpleCopyListing which doesn't "-ls -r" into the directory, but takes the
> listing via "-list"
> 2. Add an option "-list <file>" which will automatically make distcp use the
> new PregeneratedCopyListing class.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]