[ 
https://issues.apache.org/jira/browse/HADOOP-11794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15067142#comment-15067142
 ] 

Mithun Radhakrishnan commented on HADOOP-11794:
-----------------------------------------------

bq. My argument is that fetching all block locations for a file is not as 
efficient as fetching only the block range the mapper is assigned to work on.

Thank you for explaining. Let me see if I can phrase my questions more clearly 
than before:

# Would it make sense to include the block-locations within the splits, at the 
time of split-calculation, instead of the block-ranges? If yes, then we can 
make do with the API we already have, by fetching locatedBlocks for all files, 
and grouping them among the DistCp splits. (It is indeed possible that keeping 
ranges, and using your proposed API on the map-side might be faster. But those 
map-side calls might possibly also exert more parallel load on the name-node, 
depending on the number of maps.)

# Naive question: Why do we need to identify locatedBlocks? Don't HDFS files 
have uniformly sized blocks (within a file)? As such, aren't the 
block-boundaries implicit (i.e. from {{blockId*blockSize}} to 
{{(blockId+1)*(blockSize) - 1}})? Can't we simply copy that range of bytes into 
a new file (and stitch the new files in reduce)?

> distcp can copy blocks in parallel
> ----------------------------------
>
>                 Key: HADOOP-11794
>                 URL: https://issues.apache.org/jira/browse/HADOOP-11794
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: tools/distcp
>    Affects Versions: 0.21.0
>            Reporter: dhruba borthakur
>            Assignee: Yongjun Zhang
>         Attachments: MAPREDUCE-2257.patch
>
>
> The minimum unit of work for a distcp task is a file. We have files that are 
> greater than 1 TB with a block size of  1 GB. If we use distcp to copy these 
> files, the tasks either take a long long long time or finally fails. A better 
> way for distcp would be to copy all the source blocks in parallel, and then 
> stich the blocks back to files at the destination via the HDFS Concat API 
> (HDFS-222)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to