[jira] [Commented] (HADOOP-11794) distcp can copy blocks in parallel

Mithun Radhakrishnan (JIRA) Sun, 06 Dec 2015 20:56:07 -0800

    [ 
https://issues.apache.org/jira/browse/HADOOP-11794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15044373#comment-15044373
 ]


Mithun Radhakrishnan commented on HADOOP-11794:
-----------------------------------------------

Sorry, no. That's likely [~dhruba]'s work, which might have been based on the 
DistCp-v1 code. We'll need new code for the DistCp-v2 code (i.e. my rewrite 
from MAPREDUCE-2765).

Apologies if you've already thought this through. One would need to change the 
{{DynamicInputFormat#createSplits()}} implementation, which currently looks 
thus:

{code:java:borderStyle=solid:title=DynamicInputFormat.java}
  private List<InputSplit> createSplits(JobContext jobContext,
                                        List<DynamicInputChunk> chunks)
          throws IOException {
    int numMaps = getNumMapTasks(jobContext.getConfiguration());

    final int nSplits = Math.min(numMaps, chunks.size());
    List<InputSplit> splits = new ArrayList<InputSplit>(nSplits);
    
    for (int i=0; i< nSplits; ++i) {
      TaskID taskId = new TaskID(jobContext.getJobID(), TaskType.MAP, i);
      chunks.get(i).assignTo(taskId);
      splits.add(new FileSplit(chunks.get(i).getPath(), 0,
          // Setting non-zero length for FileSplit size, to avoid a possible
          // future when 0-sized file-splits are considered "empty" and skipped
          // over.
          getMinRecordsPerChunk(jobContext.getConfiguration()),
          null));
    }
    DistCpUtils.publish(jobContext.getConfiguration(),
                        CONF_LABEL_NUM_SPLITS, splits.size());
    return splits;
  }
{code}

You'll need to create a {{FileSplit}} per file-block (by first examining the 
file's block-size). The mappers will now need to emit something like 
{{(relativePathForOriginalSourceFile, targetLocation_with_block_number)}}. By 
keying on the relative-source-paths (+ expected number of blocks), you can get 
all the target-block-locations to hit the same reducer, where you can stitch 
them together. 

Good luck. :]

> distcp can copy blocks in parallel
> ----------------------------------
>
>                 Key: HADOOP-11794
>                 URL: https://issues.apache.org/jira/browse/HADOOP-11794
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: tools/distcp
>    Affects Versions: 0.21.0
>            Reporter: dhruba borthakur
>            Assignee: Yongjun Zhang
>         Attachments: MAPREDUCE-2257.patch
>
>
> The minimum unit of work for a distcp task is a file. We have files that are 
> greater than 1 TB with a block size of  1 GB. If we use distcp to copy these 
> files, the tasks either take a long long long time or finally fails. A better 
> way for distcp would be to copy all the source blocks in parallel, and then 
> stich the blocks back to files at the destination via the HDFS Concat API 
> (HDFS-222)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HADOOP-11794) distcp can copy blocks in parallel

Reply via email to