Aasha Medhi created HDFS-14869:
----------------------------------
Summary: Data loss in case of distcp using snapshot diff.
Key: HDFS-14869
URL: https://issues.apache.org/jira/browse/HDFS-14869
Project: Hadoop HDFS
Issue Type: Bug
Components: distcp
Reporter: Aasha Medhi
Steps to reproduce
* Create a directory in hdfs to copy using distcp.
* Include a staging folder in the directory.
{code:java}
[hdfs@ctr-e141-1563959304486-33995-01-000003 hadoop-mapreduce]$ hadoop fs -ls
/tmp/tocopy
Found 4 items
-rw-r--r-- 3 hdfs hdfs 16 2019-09-12 10:32 /tmp/tocopy/.b.txt
drwxr-xr-x - hdfs hdfs 0 2019-09-23 09:18 /tmp/tocopy/.staging
-rw-r--r-- 3 hdfs hdfs 12 2019-09-12 10:32 /tmp/tocopy/a.txt
-rw-r--r-- 3 hdfs hdfs 4 2019-09-20 08:23 /tmp/tocopy/foo.txt{code}
* The exclusion filter is set to exclude any staging directory
{code:java}
[hdfs@ctr-e141-1563959304486-33995-01-000003 hadoop-mapreduce]$ cat /tmp/filter
.*\.Trash.*
.*\.staging.*{code}
* Do a copy using distcp snapshots, the staging directory is not replicated.
{code:java}
hadoop jar hadoop-distcp-3.3.0-SNAPSHOT.jar
-Dmapreduce.job.user.classpath.first=true -filters /tmp/filter
/tmp/tocopy/.snapshot/s1 /tmp/target
[hdfs@ctr-e141-1563959304486-33995-01-000003 root]$ hadoop fs -ls /tmp/target
Found 3 items
-rw-r--r-- 3 hdfs hdfs 16 2019-09-24 06:56 /tmp/target/.b.txt
-rw-r--r-- 3 hdfs hdfs 12 2019-09-24 06:56 /tmp/target/a.txt
-rw-r--r-- 3 hdfs hdfs 4 2019-09-24 06:56 /tmp/target/foo.txt{code}
* Rename the staging directory to final
{code:java}
[hdfs@ctr-e141-1563959304486-33995-01-000003 hadoop-mapreduce]$ hadoop fs -mv
/tmp/tocopy/.staging /tmp/tocopy/final{code}
* Do a copy using snapshot diff.
{code:java}
[hdfs@ctr-e141-1563959304486-33995-01-000003 hadoop-mapreduce]$ hdfs
snapshotDiff /tmp/tocopy s1 s2[hdfs@ctr-e141-1563959304486-33995-01-000003
hadoop-mapreduce]$ hdfs snapshotDiff /tmp/tocopy s1 s2Difference between
snapshot s1 and snapshot s2 under directory /tmp/tocopy:M .R ./.staging ->
./final
{code}
* The diff report just has a rename record and the new final directory is
never copied.
{code:java}
[hdfs@ctr-e141-1563959304486-33995-01-000003 hadoop-mapreduce]$ hadoop jar
hadoop-distcp-3.3.0-SNAPSHOT.jar -Dmapreduce.job.user.classpath.first=true
-filters /tmp/filter -diff s1 s2 -update /tmp/tocopy /tmp/target
19/09/24 07:05:32 INFO tools.DistCp: Input Options:
DistCpOptions{atomicCommit=false, syncFolder=true, deleteMissing=false,
ignoreFailures=false, overwrite=false, append=false, useDiff=true,
useRdiff=false, fromSnapshot=s1, toSnapshot=s2, skipCRC=false, blocking=true,
numListstatusThreads=0, maxMaps=20, mapBandwidth=0.0,
copyStrategy='uniformsize', preserveStatus=[BLOCKSIZE], atomicWorkPath=null,
logPath=null, sourceFileListing=null, sourcePaths=[/tmp/tocopy],
targetPath=/tmp/target, filtersFile='/tmp/filter', blocksPerChunk=0,
copyBufferSize=8192, verboseLog=false, directWrite=false},
sourcePaths=[/tmp/tocopy], targetPathExists=true, preserveRawXattrsfalse
19/09/24 07:05:32 INFO client.RMProxy: Connecting to ResourceManager at
ctr-e141-1563959304486-33995-01-000003.hwx.site/172.27.68.128:8050
19/09/24 07:05:33 INFO client.AHSProxy: Connecting to Application History
server at ctr-e141-1563959304486-33995-01-000003.hwx.site/172.27.68.128:10200
19/09/24 07:05:33 INFO tools.DistCp: Number of paths in the copy list: 0
19/09/24 07:05:33 INFO client.RMProxy: Connecting to ResourceManager at
ctr-e141-1563959304486-33995-01-000003.hwx.site/172.27.68.128:8050
19/09/24 07:05:33 INFO client.AHSProxy: Connecting to Application History
server at ctr-e141-1563959304486-33995-01-000003.hwx.site/172.27.68.128:10200
19/09/24 07:05:33 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding
for path: /user/hdfs/.staging/job_1568647978682_0010
19/09/24 07:05:34 INFO mapreduce.JobSubmitter: number of splits:0
19/09/24 07:05:34 INFO mapreduce.JobSubmitter: Submitting tokens for job:
job_1568647978682_0010
19/09/24 07:05:34 INFO mapreduce.JobSubmitter: Executing with tokens: []
19/09/24 07:05:34 INFO conf.Configuration: found resource resource-types.xml at
file:/etc/hadoop/3.1.4.0-272/0/resource-types.xml
19/09/24 07:05:34 INFO impl.YarnClientImpl: Submitted application
application_1568647978682_0010
19/09/24 07:05:34 INFO mapreduce.Job: The url to track the job:
http://ctr-e141-1563959304486-33995-01-000003.hwx.site:8088/proxy/application_1568647978682_0010/
19/09/24 07:05:34 INFO tools.DistCp: DistCp job-id: job_1568647978682_0010
19/09/24 07:05:34 INFO mapreduce.Job: Running job: job_1568647978682_0010
19/09/24 07:05:40 INFO mapreduce.Job: Job job_1568647978682_0010 running in
uber mode : false
19/09/24 07:05:40 INFO mapreduce.Job: map 0% reduce 0%
19/09/24 07:09:43 INFO mapreduce.Job: Job job_1568647978682_0010 completed
successfully19/09/24 07:09:43 INFO mapreduce.Job: Job job_1568647978682_0010
completed successfully19/09/24 07:09:43 INFO mapreduce.Job: Counters: 2 Job
Counters Total time spent by all maps in occupied slots (ms)=0 Total time spent
by all reduces in occupied slots (ms)=0
[hdfs@ctr-e141-1563959304486-33995-01-000003 root]$ hadoop fs -ls /tmp/target
Found 3 items
-rw-r--r-- 3 hdfs hdfs 16 2019-09-24 06:56 /tmp/target/.b.txt
-rw-r--r-- 3 hdfs hdfs 12 2019-09-24 06:56 /tmp/target/a.txt
-rw-r--r-- 3 hdfs hdfs 4 2019-09-24 06:56 /tmp/target/foo.txt
{code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]