[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783023#action_12783023
 ] 

Hemanth Yamijala commented on MAPREDUCE-1231:
---------------------------------------------

I looked at the Yahoo! Hadoop 0.20 patch. One minor nit is that the internal 
config option name is different between this and the trunk patch. In the trunk 
patch, the option is distcp.skip.crc.check. In the internal patch it is 
distcp.skip.crc. Since this is a jobconf option, it may be better to keep these 
in sync. At the very least, it avoids confusion when Hadoop is upgraded to the 
trunk version.

Other than this, the 20 patch looks good.

Another point, (unrelated to this JIRA), is that the way the post-copy 
validation is done between trunk and 20 seems different. In trunk, this is done 
by a call to the API sameFile(). Hence, it includes CRC checks by default. In 
the internal 20 patch, this check is done only on file lengths irrespective of 
the option to skip crc checks. It is unclear whether this is by design. At any 
rate, this inconsistency is not related to this patch.

> Distcp is very slow
> -------------------
>
>                 Key: MAPREDUCE-1231
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1231
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: distcp
>    Affects Versions: 0.20.1
>            Reporter: Jothi Padmanabhan
>            Assignee: Jothi Padmanabhan
>         Attachments: mapred-1231-v1.patch, mapred-1231-v2.patch, 
> mapred-1231-v3.patch, mapred-1231-v3.patch, mapred-1231-y20-v2.patch, 
> mapred-1231-y20-v3.patch, mapred-1231-y20.patch, mapred-1231.patch
>
>
> Currently distcp does a checksums check in addition to file length check to 
> decide if a remote file has to be copied. If the number of files is high 
> (thousands), this checksum check is proving to be fairly costly leading to a 
> long time before the copy is started.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to