[
https://issues.apache.org/jira/browse/HADOOP-11794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15055099#comment-15055099
]
Yongjun Zhang commented on HADOOP-11794:
----------------------------------------
HI [~mithun],
Some more thinking to share.
When I commented earlier about "include <offset, chunkLength> as two new
members of class CopyListingFileStatus.", I was thinking the offset,
chunkLength at bytes level, inspired by your suggestion "You'll need to create
a FileSplit per file-block ", I think we can make them blocks.
That is, we can split the file into chunks, each chunk contains multiple
blocks. The chunk is represented as a block range <bgnIdx, numBlocks>, where
bgnIdx is the block index of the first block of the chunk, and numBlocks is
the number of blocks in the chunk. A degenerated case is what you suggested:
one file-block per split. But I'm making it more flexible here, such that we
can support variable number blocks per split.
I'd make the number of blocks per split as a distcp parameter. For a give
distcp run, the number of blocks in a split is fixed as specified by the
parameter, except for the last split of a file, which might contain fewer
blocks. BTW, introduced by "append" feature, a same file may contain blocks of
different size, thus it's not always true that each split will be same size in
bytes.
We need some new client-namenode API protocol to get back the locatedBlocks for
the specified block range, so the CopyMapper can work on the given block range
(possible there will be other application need the similar API). I will create
a jira about it.
BTW, I had quite some fun with distcp, but I did not know who is the author of
distcp v2, until working on this jira. Appreciate your excellent work!
Thanks.
> distcp can copy blocks in parallel
> ----------------------------------
>
> Key: HADOOP-11794
> URL: https://issues.apache.org/jira/browse/HADOOP-11794
> Project: Hadoop Common
> Issue Type: Improvement
> Components: tools/distcp
> Affects Versions: 0.21.0
> Reporter: dhruba borthakur
> Assignee: Yongjun Zhang
> Attachments: MAPREDUCE-2257.patch
>
>
> The minimum unit of work for a distcp task is a file. We have files that are
> greater than 1 TB with a block size of 1 GB. If we use distcp to copy these
> files, the tasks either take a long long long time or finally fails. A better
> way for distcp would be to copy all the source blocks in parallel, and then
> stich the blocks back to files at the destination via the HDFS Concat API
> (HDFS-222)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)