[jira] Commented: (MAPREDUCE-2117) Superfast Distcp when copying data within the same hdfs cluster

dhruba borthakur (JIRA) Thu, 07 Oct 2010 22:42:57 -0700

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12919160#action_12919160
 ]


dhruba borthakur commented on MAPREDUCE-2117:
---------------------------------------------

A new InputFormat for distcp so that each split comprises of one block of the 
input data set. The split is likely to be scheduled on the datanode on where 
the block resides. The map task will copy that block to a tmp file in HDFS. The 
name of the original file and the block offset will be encoded in the name of 
the tmp file. The tmp file will have a replication factor of 1, thus the data 
copy is all local to the datanode. Once all the maps finish, the distcp client 
will stich together all the tmp files into the correct destination file via the 
DistributedFileSystem.concat() call.

For performance reasons, we can implement a DistributedFileSystem.concatBulk() 
that makes a single RPC to create multiple concatenated files

> Superfast Distcp when copying data within the same hdfs cluster
> ---------------------------------------------------------------
>
>                 Key: MAPREDUCE-2117
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2117
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: distcp
>            Reporter: dhruba borthakur
>
> There are use cases when distcp is used to copy a bunch of files/directories 
> from one part of the HDFS namespace to another part within the same HDFS 
> cluster. It is superfast if we can instruct relevant datanodes to make local 
> replicas of relevant blocks and limit network usage to a minimum. It is 
> especially useful to make HBase take a backup of a region with minimum 
> downtime. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-2117) Superfast Distcp when copying data within the same hdfs cluster

Reply via email to