[ 
https://issues.apache.org/jira/browse/HADOOP-3294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12591358#action_12591358
 ] 

Doug Cutting commented on HADOOP-3294:
--------------------------------------

Verifying lengths is cheap and would catch many problems.  It could be done by 
the reducer, and the output could list any discrepancies.  Checking CRC's is 
more expensive and should be optional if implemented.

> Verifying file sizes could have some implication when we support "appends".

That's true, so we shouldn't have a discrepancy fail the job, but it should 
still be logged so that the user can see which files were modified after they 
were copied.

> distcp leaves empty blocks afte successful execution
> ----------------------------------------------------
>
>                 Key: HADOOP-3294
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3294
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: util
>    Affects Versions: 0.16.3
>         Environment: 0.16.3 without any patches. Dfs permissions turned off 
> everywhere, such that HADOOP-3138 and HADOOP-3186 do not apply
>            Reporter: Christian Kunz
>
> I copied around 40 TB between two hadoop clusters, with distcp running on 
> source.
> Job was *successful*, but one destination file was empty because of its only 
> block being empty.
> None of the distcp log files have any mentioning of this file.
> There were a couple of messages in the namenode server log of the destination 
> cluster referencing the file:
> hadoop-xxxnamenode-yyy.log.2008-04-19:2008-04-19 02:19:15,666 INFO 
> org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.allocateBlock: 
> destinationDir/_distcp_tmp_z0g93p/fileName. blk_-9209890281741927376
> hadoop-xxx-namenode-yyy.log.2008-04-19:2008-04-19 02:54:45,820 WARN 
> org.apache.hadoop.dfs.StateChange: DIR* NameSystem.internalReleaseCreate: 
> attempt to release a create lock on 
> destinationDir/_distcp_tmp_z0g93p/fileName file does not exist.
> distcp should not rely on the user to double-check.
> Would it make sense to add a reducer  to compare destination file sizes with 
> source files sizes and do some appropriate action?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to