[ 
https://issues.apache.org/jira/browse/HADOOP-2725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12563366#action_12563366
 ] 

Chris Douglas commented on HADOOP-2725:
---------------------------------------

bq. Maybe distcp should be copy a file into a temporary filename into the 
destination folder and then when the entire copy is successful, it should 
rename it to the real filename.

This is a good idea. However, since distcp accepts multiple sources, it is 
possible for multiple sources to map to the same destination. In the default 
case, skipping present files prevents both accidental deletion of data at the 
destination and- now that files appear when created- map tasks overwriting a 
file copying/copied from another map. If one doesn't expect files to be 
skipped, searching the logs for skipped files is necessary.

Copying to a temporary dir and renaming can distinguish part of the latter 
case, since collisions at creation time are unambiguously part of the copy. The 
problem changes, however, because now we must distinguish between files copying 
from another map and files that were part of a failed attempt (in the temp 
dir). The distcp user still needs to review the log to determine what- if any- 
of the cruft left in the temp dir is relevant.

Running the task a second time with '-upgrade' seems easier, if less efficient.

> Distcp truncates some files when copying
> ----------------------------------------
>
>                 Key: HADOOP-2725
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2725
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: util
>    Affects Versions: 0.16.0
>         Environment: Nightly build: 
> http://hadoopqa.yst.corp.yahoo.com:8080/hudson/job/Hadoop-LinuxTest/770/
> With patches for HADOOP-2095 and HADOOP-2119.
>            Reporter: Murtaza A. Basrai
>            Priority: Critical
>             Fix For: 0.16.0
>
>
> We used distcp to copy ~100 TB of data across two clusters ~1400 nodes each.
> Command used (it was run on the src cluster):
> hadoop distcp -log /logdir/logfile hdfs://src-namenode:8600//src-dir-1 
> hdfs://src-namenode:8600//src-dir-2 ... hdfs://src-namenode:8600//src-dir-n 
> hdfs://tgt-namenode:8600//dst-dir
> Distcp completed without errors, but when we checked the file sizes on the 
> src and tgt clusters, we noticed differences in file sizes for 9 files (~6 
> GB).
> src-file-1 666762714 bytes -> tgt-file-1 134217728 bytes
> src-file-2 673791814 bytes -> tgt-file-2 536870912 bytes
> src-file-3 692172075 bytes -> tgt-file-3 0 bytes
> All target files are truncated at block boundaries (some have 0 size).
> I looked at the log files, and noticed a few things:
> 1. There are 31059 log files (same as the number of Maps the job had).
> 2. 246 of the log files are non-empty.
> 3. All non-empty log files are of the form:
> SKIP: hdfs://src-namenode/src-dir-a/src-file-x
> SKIP: hdfs://src-namenode/src-dir-b/src-file-y
> SKIP: hdfs://src-namenode/src-dir-c/src-file-z
> 4. All 9 files which were truncated were included in the log files as skipped 
> files.
> 5. All 9 files were the last entry in their respective log files.
> e.g.
> Non-empty logfile 1:
> SKIP: hdfs://src-namenode/src-dir-a/src-file-x
> SKIP: hdfs://src-namenode/src-dir-b/src-file-y
> SKIP: hdfs://src-namenode/src-dir-c/src-file-z  <-- Truncated file
> Non_empty logfile 2:
> SKIP: hdfs://src-namenode/src-dir-p/src-file-m
> SKIP: hdfs://src-namenode/src-dir-q/src-file-n  <-- Truncated file

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to