[
https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17685982#comment-17685982
]
ASF GitHub Bot commented on HADOOP-18596:
-----------------------------------------
mehakmeet commented on code in PR #5308:
URL: https://github.com/apache/hadoop/pull/5308#discussion_r1100421615
##########
hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/CopyMapper.java:
##########
@@ -361,31 +373,60 @@ private boolean canSkip(FileSystem sourceFS,
CopyListingFileStatus source,
if (sameLength && source.getLen() == 0) {
return true;
}
- // if both the source and target have the same length, then check if the
- // config to use modification time is set to true, then use the
- // modification time and checksum comparison to determine if the copy can
- // be skipped else if not set then just use the checksum comparison to
- // check copy skip.
+ // If the src and target file have same size and block size, we would
+ // check if the checkCrc flag is enabled or not. If enabled, and the
+ // modTime comparison is enabled then return true if target file is older
+ // than the source file, since this indicates that the target file is
+ // recently updated and the source is not changed more recently than the
+ // update, we can skip the copy else we would copy.
+ // If skipCrc flag is disabled, we would check the checksum comparison
+ // which is an enum representing 3 values, of which if the comparison
+ // returns NOT_COMPATIBLE, we'll try to check modtime again, else return
+ // the result of checksum comparison which are compatible(true or false).
//
// Note: Different object stores can have different checksum algorithms
// resulting in no checksum comparison that results in return true
// always, having the modification time enabled can help in these
// scenarios to not incorrectly skip a copy. Refer: HADOOP-18596.
+
if (sameLength && sameBlockSize) {
- if (useModTimeToUpdate) {
- return
- (source.getModificationTime() < target.getModificationTime()) &&
- (skipCrc || DistCpUtils.checksumsAreEqual(sourceFS,
- source.getPath(), null,
- targetFS, target.getPath(), source.getLen()));
+ if (skipCrc) {
+ return maybeUseModTimeToCompare(source, target);
} else {
- return skipCrc || DistCpUtils
+ ChecksumComparison checksumComparison = DistCpUtils
.checksumsAreEqual(sourceFS, source.getPath(), null,
targetFS, target.getPath(), source.getLen());
+ LOG.debug("Result of checksum comparison between src {} and target "
+ + "{} : {}", source, target, checksumComparison);
+ if (checksumComparison.equals(ChecksumComparison.INCOMPATIBLE)) {
+ return maybeUseModTimeToCompare(source, target);
+ }
+ // if skipCrc is disabled and checksumComparison is compatible we
+ // need not check the mod time.
+ return checksumComparison.equals(ChecksumComparison.TRUE);
}
- } else {
- return false;
}
+ return false;
+ }
+
+ /**
+ * If the mod time comparison is enabled, check the mod time else return
+ * false.
+ * Comparison: If the target file perceives to have greater mod time(older)
+ * than the source file, we can assume that there has been no new changes
+ * that occurred in the source file, hence we should return true to skip the
+ * copy of the file.
+ * @param source Source fileStatus.
+ * @param target Target fileStatus.
+ * @return boolean representing result of modTime check.
+ */
+ private boolean maybeUseModTimeToCompare(
+ CopyListingFileStatus source, FileStatus target) {
+ if (useModTimeToUpdate) {
+ return source.getModificationTime() < target.getModificationTime();
Review Comment:
hmm, good point.
just thinking if there would ever be a scenario when the source file is
updated at the same time as it is synced to a different store, so we can have
"=" to skip the copy...
> Distcp -update between different cloud stores to use modification time while
> checking for file skip.
> ----------------------------------------------------------------------------------------------------
>
> Key: HADOOP-18596
> URL: https://issues.apache.org/jira/browse/HADOOP-18596
> Project: Hadoop Common
> Issue Type: Improvement
> Components: tools/distcp
> Reporter: Mehakmeet Singh
> Assignee: Mehakmeet Singh
> Priority: Major
> Labels: pull-request-available
>
> Distcp -update currently relies on File size, block size, and Checksum
> comparisons to figure out which files should be skipped or copied.
> Since different cloud stores have different checksum algorithms we should
> check for modification time as well to the checks.
> This would ensure that while performing -update if the files are perceived to
> be out of sync we should copy them. The machines between which the file
> transfers occur should be in time sync to avoid any extra copies.
> Improving testing and documentation for modification time checks between
> different object stores to ensure no incorrect skipping of files.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]