[
https://issues.apache.org/jira/browse/MAPREDUCE-1231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783023#action_12783023
]
Hemanth Yamijala commented on MAPREDUCE-1231:
---------------------------------------------
I looked at the Yahoo! Hadoop 0.20 patch. One minor nit is that the internal
config option name is different between this and the trunk patch. In the trunk
patch, the option is distcp.skip.crc.check. In the internal patch it is
distcp.skip.crc. Since this is a jobconf option, it may be better to keep these
in sync. At the very least, it avoids confusion when Hadoop is upgraded to the
trunk version.
Other than this, the 20 patch looks good.
Another point, (unrelated to this JIRA), is that the way the post-copy
validation is done between trunk and 20 seems different. In trunk, this is done
by a call to the API sameFile(). Hence, it includes CRC checks by default. In
the internal 20 patch, this check is done only on file lengths irrespective of
the option to skip crc checks. It is unclear whether this is by design. At any
rate, this inconsistency is not related to this patch.
> Distcp is very slow
> -------------------
>
> Key: MAPREDUCE-1231
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1231
> Project: Hadoop Map/Reduce
> Issue Type: Improvement
> Components: distcp
> Affects Versions: 0.20.1
> Reporter: Jothi Padmanabhan
> Assignee: Jothi Padmanabhan
> Attachments: mapred-1231-v1.patch, mapred-1231-v2.patch,
> mapred-1231-v3.patch, mapred-1231-v3.patch, mapred-1231-y20-v2.patch,
> mapred-1231-y20-v3.patch, mapred-1231-y20.patch, mapred-1231.patch
>
>
> Currently distcp does a checksums check in addition to file length check to
> decide if a remote file has to be copied. If the number of files is high
> (thousands), this checksum check is proving to be fairly costly leading to a
> long time before the copy is started.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.