[ 
https://issues.apache.org/jira/browse/HADOOP-11794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15067064#comment-15067064
 ] 

Yongjun Zhang commented on HADOOP-11794:
----------------------------------------

Thanks [~mithun]!

Not sure about {{CombineFileINputFormat}}, but I will take a look.

{quote}
Hmm... Do we? DistCp copies whole files (even if at a split level). Since we 
can retrieve located blocks for all blocks in the file, shouldn't that be 
enough? We could group locatedBlocks by block-id. Perhaps I'm missing something.
{quote}

Sorry I was not clear. This jira is to avoid copying a large single file within 
one mapper. What's in my mind is to break  large file into block ranges (by a 
new distcp command line arg), such as (0, 10), (10, 20), ...(100, 4), each 
entry here is a pair (starting block index, and number of blocks) here, all 
entries for the same file except the last entry have same number of blocks.  So 
we could assign the entries of the same file to different mappers (to work in 
parallel). In order to do this, we can have the API I described to fetch back 
block locations for the block range. My argument is that fetching all block 
locations for a file is not as efficient as fetching only the block range the 
mapper is assigned to work on.

Do you agree that the API would help based on my explanation here? I have done 
a prototype of the API to fetch block locations of a block range, will try to 
post it after the holiday. I think there may be other applications that need 
this kind of API too.

Thanks.



> distcp can copy blocks in parallel
> ----------------------------------
>
>                 Key: HADOOP-11794
>                 URL: https://issues.apache.org/jira/browse/HADOOP-11794
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: tools/distcp
>    Affects Versions: 0.21.0
>            Reporter: dhruba borthakur
>            Assignee: Yongjun Zhang
>         Attachments: MAPREDUCE-2257.patch
>
>
> The minimum unit of work for a distcp task is a file. We have files that are 
> greater than 1 TB with a block size of  1 GB. If we use distcp to copy these 
> files, the tasks either take a long long long time or finally fails. A better 
> way for distcp would be to copy all the source blocks in parallel, and then 
> stich the blocks back to files at the destination via the HDFS Concat API 
> (HDFS-222)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to