[
https://issues.apache.org/jira/browse/HADOOP-11794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15044373#comment-15044373
]
Mithun Radhakrishnan commented on HADOOP-11794:
-----------------------------------------------
Sorry, no. That's likely [~dhruba]'s work, which might have been based on the
DistCp-v1 code. We'll need new code for the DistCp-v2 code (i.e. my rewrite
from MAPREDUCE-2765).
Apologies if you've already thought this through. One would need to change the
{{DynamicInputFormat#createSplits()}} implementation, which currently looks
thus:
{code:java:borderStyle=solid:title=DynamicInputFormat.java}
private List<InputSplit> createSplits(JobContext jobContext,
List<DynamicInputChunk> chunks)
throws IOException {
int numMaps = getNumMapTasks(jobContext.getConfiguration());
final int nSplits = Math.min(numMaps, chunks.size());
List<InputSplit> splits = new ArrayList<InputSplit>(nSplits);
for (int i=0; i< nSplits; ++i) {
TaskID taskId = new TaskID(jobContext.getJobID(), TaskType.MAP, i);
chunks.get(i).assignTo(taskId);
splits.add(new FileSplit(chunks.get(i).getPath(), 0,
// Setting non-zero length for FileSplit size, to avoid a possible
// future when 0-sized file-splits are considered "empty" and skipped
// over.
getMinRecordsPerChunk(jobContext.getConfiguration()),
null));
}
DistCpUtils.publish(jobContext.getConfiguration(),
CONF_LABEL_NUM_SPLITS, splits.size());
return splits;
}
{code}
You'll need to create a {{FileSplit}} per file-block (by first examining the
file's block-size). The mappers will now need to emit something like
{{(relativePathForOriginalSourceFile, targetLocation_with_block_number)}}. By
keying on the relative-source-paths (+ expected number of blocks), you can get
all the target-block-locations to hit the same reducer, where you can stitch
them together.
Good luck. :]
> distcp can copy blocks in parallel
> ----------------------------------
>
> Key: HADOOP-11794
> URL: https://issues.apache.org/jira/browse/HADOOP-11794
> Project: Hadoop Common
> Issue Type: Improvement
> Components: tools/distcp
> Affects Versions: 0.21.0
> Reporter: dhruba borthakur
> Assignee: Yongjun Zhang
> Attachments: MAPREDUCE-2257.patch
>
>
> The minimum unit of work for a distcp task is a file. We have files that are
> greater than 1 TB with a block size of 1 GB. If we use distcp to copy these
> files, the tasks either take a long long long time or finally fails. A better
> way for distcp would be to copy all the source blocks in parallel, and then
> stich the blocks back to files at the destination via the HDFS Concat API
> (HDFS-222)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)