DistCp should double-check copy size when expectation is unmet
--------------------------------------------------------------

                 Key: MAPREDUCE-2161
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2161
             Project: Hadoop Map/Reduce
          Issue Type: Bug
            Reporter: Dmitriy V. Ryaboy


DistCp checks if the file size on the destination matches the file size at the 
source in order to do a basic sanity check.
When this fails, DistCp logs something along the lines of 
java.io.IOException: File size not matched: copied 3451980786 bytes (3.2g) to 
tmpfile (=hdfs://dest.hdfs/dir/_distcp_tmp_7uxv32/2010/10/26/20/fille) but 
expected 3422552064 bytes (3.2g) from hdfs://source.hdfs/dir/file)

and attempts to retry. The expected file size is picked up during 
initialization. This expectation can be incorrect for at least 2 reasons: you 
are copying a file which was being written to at the time distcp was started 
(which is a bug in and of itself), or the file was replaced at the source 
between the time the DistCp job was started and the time it actually tried to 
copy the file.

It would make sense to get the *current* the size of the origin file when this 
condition is encountered, and proceed if the newly reported file size matches 
that of the file copied.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to