DistCp should double-check copy size when expectation is unmet
--------------------------------------------------------------
Key: MAPREDUCE-2161
URL: https://issues.apache.org/jira/browse/MAPREDUCE-2161
Project: Hadoop Map/Reduce
Issue Type: Bug
Reporter: Dmitriy V. Ryaboy
DistCp checks if the file size on the destination matches the file size at the
source in order to do a basic sanity check.
When this fails, DistCp logs something along the lines of
java.io.IOException: File size not matched: copied 3451980786 bytes (3.2g) to
tmpfile (=hdfs://dest.hdfs/dir/_distcp_tmp_7uxv32/2010/10/26/20/fille) but
expected 3422552064 bytes (3.2g) from hdfs://source.hdfs/dir/file)
and attempts to retry. The expected file size is picked up during
initialization. This expectation can be incorrect for at least 2 reasons: you
are copying a file which was being written to at the time distcp was started
(which is a bug in and of itself), or the file was replaced at the source
between the time the DistCp job was started and the time it actually tried to
copy the file.
It would make sense to get the *current* the size of the origin file when this
condition is encountered, and proceed if the newly reported file size matches
that of the file copied.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.