Distcp truncates some files when copying
----------------------------------------

                 Key: HADOOP-2725
                 URL: https://issues.apache.org/jira/browse/HADOOP-2725
             Project: Hadoop Core
          Issue Type: Bug
          Components: util
    Affects Versions: 0.16.0
         Environment: Nightly build: 
http://hadoopqa.yst.corp.yahoo.com:8080/hudson/job/Hadoop-LinuxTest/770/
With patches for HADOOP-2095 and HADOOP-2119.
            Reporter: Murtaza A. Basrai
            Priority: Critical
             Fix For: 0.16.0


We used distcp to copy ~100 TB of data across two clusters ~1400 nodes each.

Command used (it was run on the src cluster):
hadoop distcp -log /logdir/logfile hdfs://src-namenode:8600//src-dir-1 
hdfs://src-namenode:8600//src-dir-2 ... hdfs://src-namenode:8600//src-dir-n 
hdfs://tgt-namenode:8600//dst-dir

Distcp completed without errors, but when we checked the file sizes on the src 
and tgt clusters, we noticed differences in file sizes for 9 files (~6 GB).

src-file-1 666762714 bytes -> tgt-file-1 134217728 bytes
src-file-2 673791814 bytes -> tgt-file-2 536870912 bytes
src-file-3 692172075 bytes -> tgt-file-3 0 bytes

All target files are truncated at block boundaries (some have 0 size).


I looked at the log files, and noticed a few things:

1. There are 31059 log files (same as the number of Maps the job had).

2. 246 of the log files are non-empty.

3. All non-empty log files are of the form:

SKIP: hdfs://src-namenode/src-dir-a/src-file-x
SKIP: hdfs://src-namenode/src-dir-b/src-file-y
SKIP: hdfs://src-namenode/src-dir-c/src-file-z

4. All 9 files which were truncated were included in the log files as skipped 
files.

5. All 9 files were the last entry in their respective log files.

e.g.
Non-empty logfile 1:

SKIP: hdfs://src-namenode/src-dir-a/src-file-x
SKIP: hdfs://src-namenode/src-dir-b/src-file-y
SKIP: hdfs://src-namenode/src-dir-c/src-file-z  <-- Truncated file

Non_empty logfile 2:
SKIP: hdfs://src-namenode/src-dir-p/src-file-m
SKIP: hdfs://src-namenode/src-dir-q/src-file-n  <-- Truncated file

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to