[
https://issues.apache.org/jira/browse/MAPREDUCE-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12919160#action_12919160
]
dhruba borthakur commented on MAPREDUCE-2117:
---------------------------------------------
A new InputFormat for distcp so that each split comprises of one block of the
input data set. The split is likely to be scheduled on the datanode on where
the block resides. The map task will copy that block to a tmp file in HDFS. The
name of the original file and the block offset will be encoded in the name of
the tmp file. The tmp file will have a replication factor of 1, thus the data
copy is all local to the datanode. Once all the maps finish, the distcp client
will stich together all the tmp files into the correct destination file via the
DistributedFileSystem.concat() call.
For performance reasons, we can implement a DistributedFileSystem.concatBulk()
that makes a single RPC to create multiple concatenated files
> Superfast Distcp when copying data within the same hdfs cluster
> ---------------------------------------------------------------
>
> Key: MAPREDUCE-2117
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2117
> Project: Hadoop Map/Reduce
> Issue Type: Improvement
> Components: distcp
> Reporter: dhruba borthakur
>
> There are use cases when distcp is used to copy a bunch of files/directories
> from one part of the HDFS namespace to another part within the same HDFS
> cluster. It is superfast if we can instruct relevant datanodes to make local
> replicas of relevant blocks and limit network usage to a minimum. It is
> especially useful to make HBase take a backup of a region with minimum
> downtime.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.