[
https://issues.apache.org/jira/browse/MAPREDUCE-2149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13079165#comment-13079165
]
Mithun Radhakrishnan commented on MAPREDUCE-2149:
-------------------------------------------------
https://issues.apache.org/jira/browse/MAPREDUCE-2765
This rewrite does attempt to address setup-times (as well as copy performance).
> Distcp : setup with update is too slow when latency is high
> -----------------------------------------------------------
>
> Key: MAPREDUCE-2149
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2149
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: distcp
> Affects Versions: 0.20.2, 0.21.0
> Reporter: Raghu Angadi
> Assignee: Raghu Angadi
> Attachments: MAPREDUCE-2149.patch
>
>
> If you run distcp with '-update' option, for _each of the files_ present on
> source cluster setup invokes a separate RPC to destination cluster to fetch
> file info.
> Usually this overhead is not very noticeable when both cluster are
> geographically close to each other. But if the latency is large, setup could
> take couple of orders of magnitude longer.
> E.g. : source has 10k directories, each with about 10 files, round trip
> latency between source and destination is 75 ms (typical for coast-to-coast
> clusters).
> If we run distcp on source cluster, set up would take about _2.5 hours_
> irrespective of whether destination has these files or not. '-lsr' on the
> same dest dir from source cluster would take up to 12 min (depending on how
> many directories already exist on dest).
> * A fairly simple fix to how setup() iterates should bring the set up time
> to same as '-lsr'. I will have a patch for this.. (though 12 min is too
> large).
> * A more scalable option is to differ update check to mappers.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira