bshashikant commented on a change in pull request #3234:
URL: https://github.com/apache/hadoop/pull/3234#discussion_r676858532
##########
File path:
hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCpSync.java
##########
@@ -563,10 +589,27 @@ private Path translateRenamedPath(Path sourcePath,
} else {
List<DiffInfo> renameDiffsList =
diffMap.get(SnapshotDiffReport.DiffType.RENAME);
+ List<DiffInfo> deletedDirDiffsList =
+ diffMap.get(SnapshotDiffReport.DiffType.DELETE);
Review comment:
The list will hold all the entries that are marked deleted. For single
directory with 1000 files when gets deleted, the snapshot diff report itself
will just have 1 deleted entry for the directory.
It does scan the whole rename list already. With this, it needs to scan the
deleted list as well and hence it can bring performance problem depending how
many entries are marked deleted.
I have modified the logic now to build a deleted list for entried which have
been marked deleted only bcoz of rename to an excluded path (excluded by
filter). This should limit the scans.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]