[
https://issues.apache.org/jira/browse/HADOOP-2725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tsz Wo (Nicholas), SZE updated HADOOP-2725:
-------------------------------------------
Attachment: 2725_20080206.patch
2725_20080206.patch: check source duplication and then do atomic copy (i.e.
copy to tmp and rename).
> Distcp truncates some files when copying
> ----------------------------------------
>
> Key: HADOOP-2725
> URL: https://issues.apache.org/jira/browse/HADOOP-2725
> Project: Hadoop Core
> Issue Type: Bug
> Components: dfs, util
> Affects Versions: 0.16.0
> Environment: Nightly build:
> http://hadoopqa.yst.corp.yahoo.com:8080/hudson/job/Hadoop-LinuxTest/770/
> With patches for HADOOP-2095 and HADOOP-2119.
> Reporter: Murtaza A. Basrai
> Assignee: Tsz Wo (Nicholas), SZE
> Priority: Critical
> Fix For: 0.16.1
>
> Attachments: 2725_20080206.patch
>
>
> We used distcp to copy ~100 TB of data across two clusters ~1400 nodes each.
> Command used (it was run on the src cluster):
> hadoop distcp -log /logdir/logfile hdfs://src-namenode:8600//src-dir-1
> hdfs://src-namenode:8600//src-dir-2 ... hdfs://src-namenode:8600//src-dir-n
> hdfs://tgt-namenode:8600//dst-dir
> Distcp completed without errors, but when we checked the file sizes on the
> src and tgt clusters, we noticed differences in file sizes for 9 files (~6
> GB).
> src-file-1 666762714 bytes -> tgt-file-1 134217728 bytes
> src-file-2 673791814 bytes -> tgt-file-2 536870912 bytes
> src-file-3 692172075 bytes -> tgt-file-3 0 bytes
> All target files are truncated at block boundaries (some have 0 size).
> I looked at the log files, and noticed a few things:
> 1. There are 31059 log files (same as the number of Maps the job had).
> 2. 246 of the log files are non-empty.
> 3. All non-empty log files are of the form:
> SKIP: hdfs://src-namenode/src-dir-a/src-file-x
> SKIP: hdfs://src-namenode/src-dir-b/src-file-y
> SKIP: hdfs://src-namenode/src-dir-c/src-file-z
> 4. All 9 files which were truncated were included in the log files as skipped
> files.
> 5. All 9 files were the last entry in their respective log files.
> e.g.
> Non-empty logfile 1:
> SKIP: hdfs://src-namenode/src-dir-a/src-file-x
> SKIP: hdfs://src-namenode/src-dir-b/src-file-y
> SKIP: hdfs://src-namenode/src-dir-c/src-file-z <-- Truncated file
> Non_empty logfile 2:
> SKIP: hdfs://src-namenode/src-dir-p/src-file-m
> SKIP: hdfs://src-namenode/src-dir-q/src-file-n <-- Truncated file
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.