[ 
https://issues.apache.org/jira/browse/HADOOP-11794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15055099#comment-15055099
 ] 

Yongjun Zhang commented on HADOOP-11794:
----------------------------------------

HI [~mithun],

Some more thinking to share.

When I commented earlier about  "include <offset, chunkLength> as two new 
members of class CopyListingFileStatus.", I was thinking the offset, 
chunkLength at bytes level, inspired by your suggestion "You'll need to create 
a FileSplit per file-block ", I think we can make them blocks.

That is, we can split the file into chunks, each chunk contains multiple 
blocks. The chunk is represented as a block range <bgnIdx, numBlocks>, where 
bgnIdx is the block index of the first block of the chunk, and numBlocks is  
the number of blocks in the chunk. A degenerated case is what you suggested: 
one file-block per split. But I'm making it more flexible here, such that we 
can support variable number blocks per split.

I'd make the number of blocks per split as a distcp parameter. For a give 
distcp run, the number of blocks in a split is fixed as specified by the 
parameter, except for  the last split of a file, which might contain fewer 
blocks. BTW, introduced by "append" feature, a same file may contain blocks of 
different size, thus it's not always true that each split will be same size in 
bytes.

We need some new client-namenode API protocol to get back the locatedBlocks for 
the specified block range, so the CopyMapper can work on the given block range 
(possible there will be other application need the similar API). I will create 
a jira about it.

BTW, I had quite some fun with distcp, but I did not know who is the author of 
distcp v2, until working on this jira. Appreciate your excellent work!

Thanks.


> distcp can copy blocks in parallel
> ----------------------------------
>
>                 Key: HADOOP-11794
>                 URL: https://issues.apache.org/jira/browse/HADOOP-11794
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: tools/distcp
>    Affects Versions: 0.21.0
>            Reporter: dhruba borthakur
>            Assignee: Yongjun Zhang
>         Attachments: MAPREDUCE-2257.patch
>
>
> The minimum unit of work for a distcp task is a file. We have files that are 
> greater than 1 TB with a block size of  1 GB. If we use distcp to copy these 
> files, the tasks either take a long long long time or finally fails. A better 
> way for distcp would be to copy all the source blocks in parallel, and then 
> stich the blocks back to files at the destination via the HDFS Concat API 
> (HDFS-222)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to