[
https://issues.apache.org/jira/browse/MAPREDUCE-1231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782566#action_12782566
]
Aaron Kimball commented on MAPREDUCE-1231:
------------------------------------------
Arun: I agree that using the checksums should be the default if this is the
behavior that users have come to expect.
rsync does its transfer-list generation based on file length and mtime (and,
optionally, a checksum as well with {{-c}}). Checking only on the length feels
risky to me. I think that it would be much more convincing if {{sameFile()}}
included an mtime comparison. This shouldn't add significant processing
overhead, since the same {{getStatus()}} call returns both length and mtime.
(For this to work right, though, you'd need to preserve the mtime on the
receiver OS, which is done with {{-pt}}. So maybe disabling checksum should
also recommend the use of {{-pt}} in a message to the user?)
And if mtime comparisons were added, that should also be user-ignorable.
> Distcp is very slow
> -------------------
>
> Key: MAPREDUCE-1231
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1231
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: distcp
> Affects Versions: 0.20.1
> Reporter: Jothi Padmanabhan
> Assignee: Jothi Padmanabhan
> Fix For: 0.20.2
>
> Attachments: mapred-1231-v1.patch, mapred-1231-v2.patch,
> mapred-1231-y20-v2.patch, mapred-1231-y20.patch, mapred-1231.patch
>
>
> Currently distcp does a checksums check in addition to file length check to
> decide if a remote file has to be copied. If the number of files is high
> (thousands), this checksum check is proving to be fairly costly leading to a
> long time before the copy is started.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.