[jira] [Commented] (HDFS-14869) Data loss in case of distcp using snapshot diff. Replication should include rename records if file was skipped in the previous iteration
[ https://issues.apache.org/jira/browse/HDFS-14869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16989699#comment-16989699 ] Hudson commented on HDFS-14869: --- SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #17732 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/17732/]) HDFS-14869 Copy renamed files which are not excluded anymore by filter (shashikant: rev fc97034b29243a0509633849de55aa734859) * (edit) hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCpSync.java * (edit) hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCp.java * (edit) hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/TestDistCpSync.java > Data loss in case of distcp using snapshot diff. Replication should include > rename records if file was skipped in the previous iteration > > > Key: HDFS-14869 > URL: https://issues.apache.org/jira/browse/HDFS-14869 > Project: Hadoop HDFS > Issue Type: Bug > Components: distcp >Reporter: Aasha Medhi >Assignee: Aasha Medhi >Priority: Major > Fix For: 3.1.4 > > > This issue arises when a directory or file is excluded by exclusion filter > during distcp replication. Later on if the directory is renamed later to a > name which is not excluded by the filter, the snapshot diff reports only a > rename operation. The directory is never copied to target even though its > not excluded now. This also doesn't throw any error so there is no way to > find the issue. > Steps to reproduce > * Create a directory in hdfs to copy using distcp. > * Include a staging folder in the directory. > {code:java} > [hdfs@ctr-e141-1563959304486-33995-01-03 hadoop-mapreduce]$ hadoop fs -ls > /tmp/tocopy > Found 4 items > -rw-r--r-- 3 hdfs hdfs 16 2019-09-12 10:32 /tmp/tocopy/.b.txt > drwxr-xr-x - hdfs hdfs 0 2019-09-23 09:18 /tmp/tocopy/.staging > -rw-r--r-- 3 hdfs hdfs 12 2019-09-12 10:32 /tmp/tocopy/a.txt > -rw-r--r-- 3 hdfs hdfs 4 2019-09-20 08:23 /tmp/tocopy/foo.txt{code} > * The exclusion filter is set to exclude any staging directory > {code:java} > [hdfs@ctr-e141-1563959304486-33995-01-03 hadoop-mapreduce]$ cat > /tmp/filter > .*\.Trash.* > .*\.staging.*{code} > * Do a copy using distcp snapshots, the staging directory is not replicated. > {code:java} > hadoop jar hadoop-distcp-3.3.0-SNAPSHOT.jar > -Dmapreduce.job.user.classpath.first=true -filters /tmp/filter > /tmp/tocopy/.snapshot/s1 /tmp/target > [hdfs@ctr-e141-1563959304486-33995-01-03 root]$ hadoop fs -ls /tmp/target > Found 3 items > -rw-r--r-- 3 hdfs hdfs 16 2019-09-24 06:56 /tmp/target/.b.txt > -rw-r--r-- 3 hdfs hdfs 12 2019-09-24 06:56 /tmp/target/a.txt > -rw-r--r-- 3 hdfs hdfs 4 2019-09-24 06:56 /tmp/target/foo.txt{code} > * Rename the staging directory to final > {code:java} > [hdfs@ctr-e141-1563959304486-33995-01-03 hadoop-mapreduce]$ hadoop fs -mv > /tmp/tocopy/.staging /tmp/tocopy/final{code} > * Do a copy using snapshot diff. > {code:java} > [hdfs@ctr-e141-1563959304486-33995-01-03 hadoop-mapreduce]$ hdfs > snapshotDiff /tmp/tocopy s1 s2[hdfs@ctr-e141-1563959304486-33995-01-03 > hadoop-mapreduce]$ hdfs snapshotDiff /tmp/tocopy s1 s2Difference between > snapshot s1 and snapshot s2 under directory /tmp/tocopy:M .R ./.staging -> > ./final > {code} > * The diff report just has a rename record and the new final directory is > never copied. > {code:java} > [hdfs@ctr-e141-1563959304486-33995-01-03 hadoop-mapreduce]$ hadoop jar > hadoop-distcp-3.3.0-SNAPSHOT.jar -Dmapreduce.job.user.classpath.first=true > -filters /tmp/filter -diff s1 s2 -update /tmp/tocopy /tmp/target > 19/09/24 07:05:32 INFO tools.DistCp: Input Options: > DistCpOptions{atomicCommit=false, syncFolder=true, deleteMissing=false, > ignoreFailures=false, overwrite=false, append=false, useDiff=true, > useRdiff=false, fromSnapshot=s1, toSnapshot=s2, skipCRC=false, blocking=true, > numListstatusThreads=0, maxMaps=20, mapBandwidth=0.0, > copyStrategy='uniformsize', preserveStatus=[BLOCKSIZE], atomicWorkPath=null, > logPath=null, sourceFileListing=null, sourcePaths=[/tmp/tocopy], > targetPath=/tmp/target, filtersFile='/tmp/filter', blocksPerChunk=0, > copyBufferSize=8192, verboseLog=false, directWrite=false}, > sourcePaths=[/tmp/tocopy], targetPathExists=true, preserveRawXattrsfalse > 19/09/24 07:05:32 INFO client.RMProxy: Connecting to ResourceManager at > ctr-e141-1563959304486-33995-01-03.hwx.site/172.27.68.128:8050 > 19/09/24 07:05:33 INFO client.AHSProxy: Connecting to Application History > server at ctr-e141-1563959304486-33995-01-03.hwx.site/172.27.68.128
[jira] [Commented] (HDFS-14869) Data loss in case of distcp using snapshot diff. Replication should include rename records if file was skipped in the previous iteration
[ https://issues.apache.org/jira/browse/HDFS-14869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948384#comment-16948384 ] Steve Loughran commented on HDFS-14869: --- as I said on the PR, I do not know enough about distCp to be safe to review this. I've had a quick glance at the diff and suggested changes to the test based on what I like to see in tests, but someone else is going to need to have to look at the core algorithm. Sorry > Data loss in case of distcp using snapshot diff. Replication should include > rename records if file was skipped in the previous iteration > > > Key: HDFS-14869 > URL: https://issues.apache.org/jira/browse/HDFS-14869 > Project: Hadoop HDFS > Issue Type: Bug > Components: distcp >Reporter: Aasha Medhi >Assignee: Aasha Medhi >Priority: Major > > This issue arises when a directory or file is excluded by exclusion filter > during distcp replication. Later on if the directory is renamed later to a > name which is not excluded by the filter, the snapshot diff reports only a > rename operation. The directory is never copied to target even though its > not excluded now. This also doesn't throw any error so there is no way to > find the issue. > Steps to reproduce > * Create a directory in hdfs to copy using distcp. > * Include a staging folder in the directory. > {code:java} > [hdfs@ctr-e141-1563959304486-33995-01-03 hadoop-mapreduce]$ hadoop fs -ls > /tmp/tocopy > Found 4 items > -rw-r--r-- 3 hdfs hdfs 16 2019-09-12 10:32 /tmp/tocopy/.b.txt > drwxr-xr-x - hdfs hdfs 0 2019-09-23 09:18 /tmp/tocopy/.staging > -rw-r--r-- 3 hdfs hdfs 12 2019-09-12 10:32 /tmp/tocopy/a.txt > -rw-r--r-- 3 hdfs hdfs 4 2019-09-20 08:23 /tmp/tocopy/foo.txt{code} > * The exclusion filter is set to exclude any staging directory > {code:java} > [hdfs@ctr-e141-1563959304486-33995-01-03 hadoop-mapreduce]$ cat > /tmp/filter > .*\.Trash.* > .*\.staging.*{code} > * Do a copy using distcp snapshots, the staging directory is not replicated. > {code:java} > hadoop jar hadoop-distcp-3.3.0-SNAPSHOT.jar > -Dmapreduce.job.user.classpath.first=true -filters /tmp/filter > /tmp/tocopy/.snapshot/s1 /tmp/target > [hdfs@ctr-e141-1563959304486-33995-01-03 root]$ hadoop fs -ls /tmp/target > Found 3 items > -rw-r--r-- 3 hdfs hdfs 16 2019-09-24 06:56 /tmp/target/.b.txt > -rw-r--r-- 3 hdfs hdfs 12 2019-09-24 06:56 /tmp/target/a.txt > -rw-r--r-- 3 hdfs hdfs 4 2019-09-24 06:56 /tmp/target/foo.txt{code} > * Rename the staging directory to final > {code:java} > [hdfs@ctr-e141-1563959304486-33995-01-03 hadoop-mapreduce]$ hadoop fs -mv > /tmp/tocopy/.staging /tmp/tocopy/final{code} > * Do a copy using snapshot diff. > {code:java} > [hdfs@ctr-e141-1563959304486-33995-01-03 hadoop-mapreduce]$ hdfs > snapshotDiff /tmp/tocopy s1 s2[hdfs@ctr-e141-1563959304486-33995-01-03 > hadoop-mapreduce]$ hdfs snapshotDiff /tmp/tocopy s1 s2Difference between > snapshot s1 and snapshot s2 under directory /tmp/tocopy:M .R ./.staging -> > ./final > {code} > * The diff report just has a rename record and the new final directory is > never copied. > {code:java} > [hdfs@ctr-e141-1563959304486-33995-01-03 hadoop-mapreduce]$ hadoop jar > hadoop-distcp-3.3.0-SNAPSHOT.jar -Dmapreduce.job.user.classpath.first=true > -filters /tmp/filter -diff s1 s2 -update /tmp/tocopy /tmp/target > 19/09/24 07:05:32 INFO tools.DistCp: Input Options: > DistCpOptions{atomicCommit=false, syncFolder=true, deleteMissing=false, > ignoreFailures=false, overwrite=false, append=false, useDiff=true, > useRdiff=false, fromSnapshot=s1, toSnapshot=s2, skipCRC=false, blocking=true, > numListstatusThreads=0, maxMaps=20, mapBandwidth=0.0, > copyStrategy='uniformsize', preserveStatus=[BLOCKSIZE], atomicWorkPath=null, > logPath=null, sourceFileListing=null, sourcePaths=[/tmp/tocopy], > targetPath=/tmp/target, filtersFile='/tmp/filter', blocksPerChunk=0, > copyBufferSize=8192, verboseLog=false, directWrite=false}, > sourcePaths=[/tmp/tocopy], targetPathExists=true, preserveRawXattrsfalse > 19/09/24 07:05:32 INFO client.RMProxy: Connecting to ResourceManager at > ctr-e141-1563959304486-33995-01-03.hwx.site/172.27.68.128:8050 > 19/09/24 07:05:33 INFO client.AHSProxy: Connecting to Application History > server at ctr-e141-1563959304486-33995-01-03.hwx.site/172.27.68.128:10200 > 19/09/24 07:05:33 INFO tools.DistCp: Number of paths in the copy list: 0 > 19/09/24 07:05:33 INFO client.RMProxy: Connecting to ResourceManager at > ctr-e141-1563959304486-33995-01-03.hwx.site/172.27.68.128:8050 > 19/09/24 07:05:33 INFO client.AHSProxy: Connecti
[jira] [Commented] (HDFS-14869) Data loss in case of distcp using snapshot diff. Replication should include rename records if file was skipped in the previous iteration
[ https://issues.apache.org/jira/browse/HDFS-14869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948318#comment-16948318 ] Shashikant Banerjee commented on HDFS-14869: [~ste...@apache.org], can you please have a look at this? > Data loss in case of distcp using snapshot diff. Replication should include > rename records if file was skipped in the previous iteration > > > Key: HDFS-14869 > URL: https://issues.apache.org/jira/browse/HDFS-14869 > Project: Hadoop HDFS > Issue Type: Bug > Components: distcp >Reporter: Aasha Medhi >Assignee: Aasha Medhi >Priority: Major > > This issue arises when a directory or file is excluded by exclusion filter > during distcp replication. Later on if the directory is renamed later to a > name which is not excluded by the filter, the snapshot diff reports only a > rename operation. The directory is never copied to target even though its > not excluded now. This also doesn't throw any error so there is no way to > find the issue. > Steps to reproduce > * Create a directory in hdfs to copy using distcp. > * Include a staging folder in the directory. > {code:java} > [hdfs@ctr-e141-1563959304486-33995-01-03 hadoop-mapreduce]$ hadoop fs -ls > /tmp/tocopy > Found 4 items > -rw-r--r-- 3 hdfs hdfs 16 2019-09-12 10:32 /tmp/tocopy/.b.txt > drwxr-xr-x - hdfs hdfs 0 2019-09-23 09:18 /tmp/tocopy/.staging > -rw-r--r-- 3 hdfs hdfs 12 2019-09-12 10:32 /tmp/tocopy/a.txt > -rw-r--r-- 3 hdfs hdfs 4 2019-09-20 08:23 /tmp/tocopy/foo.txt{code} > * The exclusion filter is set to exclude any staging directory > {code:java} > [hdfs@ctr-e141-1563959304486-33995-01-03 hadoop-mapreduce]$ cat > /tmp/filter > .*\.Trash.* > .*\.staging.*{code} > * Do a copy using distcp snapshots, the staging directory is not replicated. > {code:java} > hadoop jar hadoop-distcp-3.3.0-SNAPSHOT.jar > -Dmapreduce.job.user.classpath.first=true -filters /tmp/filter > /tmp/tocopy/.snapshot/s1 /tmp/target > [hdfs@ctr-e141-1563959304486-33995-01-03 root]$ hadoop fs -ls /tmp/target > Found 3 items > -rw-r--r-- 3 hdfs hdfs 16 2019-09-24 06:56 /tmp/target/.b.txt > -rw-r--r-- 3 hdfs hdfs 12 2019-09-24 06:56 /tmp/target/a.txt > -rw-r--r-- 3 hdfs hdfs 4 2019-09-24 06:56 /tmp/target/foo.txt{code} > * Rename the staging directory to final > {code:java} > [hdfs@ctr-e141-1563959304486-33995-01-03 hadoop-mapreduce]$ hadoop fs -mv > /tmp/tocopy/.staging /tmp/tocopy/final{code} > * Do a copy using snapshot diff. > {code:java} > [hdfs@ctr-e141-1563959304486-33995-01-03 hadoop-mapreduce]$ hdfs > snapshotDiff /tmp/tocopy s1 s2[hdfs@ctr-e141-1563959304486-33995-01-03 > hadoop-mapreduce]$ hdfs snapshotDiff /tmp/tocopy s1 s2Difference between > snapshot s1 and snapshot s2 under directory /tmp/tocopy:M .R ./.staging -> > ./final > {code} > * The diff report just has a rename record and the new final directory is > never copied. > {code:java} > [hdfs@ctr-e141-1563959304486-33995-01-03 hadoop-mapreduce]$ hadoop jar > hadoop-distcp-3.3.0-SNAPSHOT.jar -Dmapreduce.job.user.classpath.first=true > -filters /tmp/filter -diff s1 s2 -update /tmp/tocopy /tmp/target > 19/09/24 07:05:32 INFO tools.DistCp: Input Options: > DistCpOptions{atomicCommit=false, syncFolder=true, deleteMissing=false, > ignoreFailures=false, overwrite=false, append=false, useDiff=true, > useRdiff=false, fromSnapshot=s1, toSnapshot=s2, skipCRC=false, blocking=true, > numListstatusThreads=0, maxMaps=20, mapBandwidth=0.0, > copyStrategy='uniformsize', preserveStatus=[BLOCKSIZE], atomicWorkPath=null, > logPath=null, sourceFileListing=null, sourcePaths=[/tmp/tocopy], > targetPath=/tmp/target, filtersFile='/tmp/filter', blocksPerChunk=0, > copyBufferSize=8192, verboseLog=false, directWrite=false}, > sourcePaths=[/tmp/tocopy], targetPathExists=true, preserveRawXattrsfalse > 19/09/24 07:05:32 INFO client.RMProxy: Connecting to ResourceManager at > ctr-e141-1563959304486-33995-01-03.hwx.site/172.27.68.128:8050 > 19/09/24 07:05:33 INFO client.AHSProxy: Connecting to Application History > server at ctr-e141-1563959304486-33995-01-03.hwx.site/172.27.68.128:10200 > 19/09/24 07:05:33 INFO tools.DistCp: Number of paths in the copy list: 0 > 19/09/24 07:05:33 INFO client.RMProxy: Connecting to ResourceManager at > ctr-e141-1563959304486-33995-01-03.hwx.site/172.27.68.128:8050 > 19/09/24 07:05:33 INFO client.AHSProxy: Connecting to Application History > server at ctr-e141-1563959304486-33995-01-03.hwx.site/172.27.68.128:10200 > 19/09/24 07:05:33 INFO mapreduce.JobResourceUploader: Disabling Erasure > Coding for path: /user