[ https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17685982#comment-17685982 ]
ASF GitHub Bot commented on HADOOP-18596: ----------------------------------------- mehakmeet commented on code in PR #5308: URL: https://github.com/apache/hadoop/pull/5308#discussion_r1100421615 ########## hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/CopyMapper.java: ########## @@ -361,31 +373,60 @@ private boolean canSkip(FileSystem sourceFS, CopyListingFileStatus source, if (sameLength && source.getLen() == 0) { return true; } - // if both the source and target have the same length, then check if the - // config to use modification time is set to true, then use the - // modification time and checksum comparison to determine if the copy can - // be skipped else if not set then just use the checksum comparison to - // check copy skip. + // If the src and target file have same size and block size, we would + // check if the checkCrc flag is enabled or not. If enabled, and the + // modTime comparison is enabled then return true if target file is older + // than the source file, since this indicates that the target file is + // recently updated and the source is not changed more recently than the + // update, we can skip the copy else we would copy. + // If skipCrc flag is disabled, we would check the checksum comparison + // which is an enum representing 3 values, of which if the comparison + // returns NOT_COMPATIBLE, we'll try to check modtime again, else return + // the result of checksum comparison which are compatible(true or false). // // Note: Different object stores can have different checksum algorithms // resulting in no checksum comparison that results in return true // always, having the modification time enabled can help in these // scenarios to not incorrectly skip a copy. Refer: HADOOP-18596. + if (sameLength && sameBlockSize) { - if (useModTimeToUpdate) { - return - (source.getModificationTime() < target.getModificationTime()) && - (skipCrc || DistCpUtils.checksumsAreEqual(sourceFS, - source.getPath(), null, - targetFS, target.getPath(), source.getLen())); + if (skipCrc) { + return maybeUseModTimeToCompare(source, target); } else { - return skipCrc || DistCpUtils + ChecksumComparison checksumComparison = DistCpUtils .checksumsAreEqual(sourceFS, source.getPath(), null, targetFS, target.getPath(), source.getLen()); + LOG.debug("Result of checksum comparison between src {} and target " + + "{} : {}", source, target, checksumComparison); + if (checksumComparison.equals(ChecksumComparison.INCOMPATIBLE)) { + return maybeUseModTimeToCompare(source, target); + } + // if skipCrc is disabled and checksumComparison is compatible we + // need not check the mod time. + return checksumComparison.equals(ChecksumComparison.TRUE); } - } else { - return false; } + return false; + } + + /** + * If the mod time comparison is enabled, check the mod time else return + * false. + * Comparison: If the target file perceives to have greater mod time(older) + * than the source file, we can assume that there has been no new changes + * that occurred in the source file, hence we should return true to skip the + * copy of the file. + * @param source Source fileStatus. + * @param target Target fileStatus. + * @return boolean representing result of modTime check. + */ + private boolean maybeUseModTimeToCompare( + CopyListingFileStatus source, FileStatus target) { + if (useModTimeToUpdate) { + return source.getModificationTime() < target.getModificationTime(); Review Comment: hmm, good point. just thinking if there would ever be a scenario when the source file is updated at the same time as it is synced to a different store, so we can have "=" to skip the copy... > Distcp -update between different cloud stores to use modification time while > checking for file skip. > ---------------------------------------------------------------------------------------------------- > > Key: HADOOP-18596 > URL: https://issues.apache.org/jira/browse/HADOOP-18596 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp > Reporter: Mehakmeet Singh > Assignee: Mehakmeet Singh > Priority: Major > Labels: pull-request-available > > Distcp -update currently relies on File size, block size, and Checksum > comparisons to figure out which files should be skipped or copied. > Since different cloud stores have different checksum algorithms we should > check for modification time as well to the checks. > This would ensure that while performing -update if the files are perceived to > be out of sync we should copy them. The machines between which the file > transfers occur should be in time sync to avoid any extra copies. > Improving testing and documentation for modification time checks between > different object stores to ensure no incorrect skipping of files. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org