[ 
https://issues.apache.org/jira/browse/HDFS-16145?focusedWorklogId=627930&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-627930
 ]

ASF GitHub Bot logged work on HDFS-16145:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 26/Jul/21 18:49
            Start Date: 26/Jul/21 18:49
    Worklog Time Spent: 10m 
      Work Description: bshashikant commented on a change in pull request #3234:
URL: https://github.com/apache/hadoop/pull/3234#discussion_r676858532



##########
File path: 
hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCpSync.java
##########
@@ -563,10 +589,27 @@ private Path translateRenamedPath(Path sourcePath,
     } else {
       List<DiffInfo> renameDiffsList =
           diffMap.get(SnapshotDiffReport.DiffType.RENAME);
+      List<DiffInfo> deletedDirDiffsList =
+          diffMap.get(SnapshotDiffReport.DiffType.DELETE);

Review comment:
       The list will hold all the entries that are marked deleted. For single 
directory with 1000 files when gets deleted, the snapshot diff report itself 
will just have 1 deleted entry for the directory.
   
   It does scan the whole rename list already. With this, it needs to scan the 
deleted list as well and hence it can bring performance problem depending how 
many entries are marked deleted.
   
   I have modified the logic now to build a deleted list for entried which have 
been marked deleted only bcoz of rename to an excluded path (excluded by 
filter). This should limit the scans.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 627930)
    Time Spent: 1.5h  (was: 1h 20m)

> CopyListing fails with FNF exception with snapshot diff
> -------------------------------------------------------
>
>                 Key: HDFS-16145
>                 URL: https://issues.apache.org/jira/browse/HDFS-16145
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: distcp
>            Reporter: Shashikant Banerjee
>            Assignee: Shashikant Banerjee
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Distcp with snapshotdiff and with filters, marks a Rename as a delete 
> opeartion on the target if the rename target is to a directory which is 
> exluded by the filter. But, in cases, where files/subdirs created/modified 
> prior to the Rename post the old snapshot will still be present as 
> modified/created entries in the final copy list. Since, the parent diretory 
> is marked for deletion, these subsequent create/modify entries should be 
> ignored while building the final copy list. 
> With such cases, when the final copy list is built, distcp tries to do a 
> lookup for each create/modified file in the newer snapshot which will fail 
> as, the parent dir is already moved to a new location in later snapshot.
>  
> {code:java}
> sudo -u kms hadoop key create testkey
> hadoop fs -mkdir -p /data/gcgdlknnasg/
> hdfs crypto -createZone -keyName testkey -path /data/gcgdlknnasg/
> hadoop fs -mkdir -p /dest/gcgdlknnasg
> hdfs crypto -createZone -keyName testkey -path /dest/gcgdlknnasg
> hdfs dfs -mkdir /data/gcgdlknnasg/dir1
> hdfs dfsadmin -allowSnapshot /data/gcgdlknnasg/ 
> hdfs dfsadmin -allowSnapshot /dest/gcgdlknnasg/ 
> [root@nightly62x-1 logs]# hdfs dfs -ls -R /data/gcgdlknnasg/
> drwxrwxrwt   - hdfs supergroup          0 2021-07-16 14:05 
> /data/gcgdlknnasg/.Trash
> drwxr-xr-x   - hdfs supergroup          0 2021-07-16 13:07 
> /data/gcgdlknnasg/dir1
> [root@nightly62x-1 logs]# hdfs dfs -ls -R /dest/gcgdlknnasg/
> [root@nightly62x-1 logs]#
> hdfs dfs -put /etc/hosts /data/gcgdlknnasg/dir1/
> hdfs dfs -rm -r /data/gcgdlknnasg/dir1/
> hdfs dfs -mkdir /data/gcgdlknnasg/dir1/
> ===> Run BDR with “Abort on Snapshot Diff Failures” CHECKED now in the 
> replication schedule. You get into below error and failure of the BDR job.
> 21/07/16 15:02:30 INFO distcp.DistCp: Failed to use snapshot diff - 
> java.io.FileNotFoundException: File does not exist: 
> /data/gcgdlknnasg/.snapshot/distcp-5-46485360-new/dir1/hosts
>       at 
> org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1494)
>       at 
> org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1487)
> ……..
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to