distcp split generation does not work correctly
-----------------------------------------------

                 Key: HADOOP-2032
                 URL: https://issues.apache.org/jira/browse/HADOOP-2032
             Project: Hadoop
          Issue Type: Bug
          Components: util
            Reporter: Runping Qi



With the current implementation, distcp will always assign multiple files to 
one mapper to copy, no matter how large 
are the files. This is because the CopyFiles class uses a sequencefile to store 
the list of files to be copied, 
one record per file. CopyFile class correctly generates one split per record in 
the sequence file. However, 
due to  the way the sequence file record reader works, the minimum unit for 
splits is the segments between the 
"syncmarks" in the sequence file. 
This results in the strange behavior that some mappers get zero records (zero 
files to copy) even though their 
split lengths are non-zero, while other mappers get multiple records (multiple 
filesto copy) from their split (and beyond
to the next sync mark). 

When CopyFile class creates the sequencefile, it does try to place a sync mark 
between splitable segments in the sequence file by calling sync() function of 
the sequence file record writer. 
Unfortunately, the sync() function is a no-op for files that are not block 
compressed.

Naturally, after I changed the compression type for the sequence file to block 
compression,
mappers got the correct records from their splits.
So a simple fix is to change the compression tye to CompressionType.BLOCK:

{code}
// create src list
    SequenceFile.Writer writer = SequenceFile.createWriter(
        jobDirectory.getFileSystem(jobConf), jobConf, srcfilelist,
        LongWritable.class, FilePair.class,
        SequenceFile.CompressionType.BLOCK);.
{code}



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to