[ https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17684868#comment-17684868 ]
ASF GitHub Bot commented on HADOOP-18596: ----------------------------------------- steveloughran commented on code in PR #5308: URL: https://github.com/apache/hadoop/pull/5308#discussion_r1097777011 ########## hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCpConstants.java: ########## @@ -142,6 +142,19 @@ private DistCpConstants() { "distcp.blocks.per.chunk"; public static final String CONF_LABEL_USE_ITERATOR = "distcp.use.iterator"; + + /** + * Enabling distcp -update to use modification time of source and target Review Comment: nit, use {@code distcp -update} for the better formatting ########## hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/CopyMapper.java: ########## @@ -114,6 +115,8 @@ public void setup(Context context) throws IOException, InterruptedException { PRESERVE_STATUS.getConfigLabel())); directWrite = conf.getBoolean( DistCpOptionSwitch.DIRECT_WRITE.getConfigLabel(), false); + useModTimeToUpdate = + conf.getBoolean(DistCpConstants.CONF_LABEL_UPDATE_MOD_TIME, true); Review Comment: refer to that proposed constant for a default value ########## hadoop-tools/hadoop-distcp/src/site/markdown/DistCp.md.vm: ########## @@ -631,14 +631,39 @@ hadoop distcp -update -numListstatusThreads 20 \ Because object stores are slow to list files, consider setting the `-numListstatusThreads` option when performing a `-update` operation on a large directory tree (the limit is 40 threads). -When `DistCp -update` is used with object stores, -generally only the modification time and length of the individual files are compared, -not any checksums. The fact that most object stores do have valid timestamps -for directories is irrelevant; only the file timestamps are compared. -However, it is important to have the clock of the client computers close -to that of the infrastructure, so that timestamps are consistent between -the client/HDFS cluster and that of the object store. Otherwise, changed files may be -missed/copied too often. +When `DistCp -update` is used with object stores, generally only the +modification time and length of the individual files are compared, not any +checksums if the checksum algorithm between the two stores is different. + +* The `distcp -update` between two object stores with different checksum + algorithm compares the modification times of source and target files along + with the file size to determine whether to skip the file copy. The behavior + is controlled by the property `distcp.update.modification.time`, which is + set to true by default. If the source file is more recently modified than + the target file, it is assumed that the content has changed, and the file + should be updated. + We need to ensure that there is no clock skew between the machines. + The fact that most object stores do have valid timestamps for directories + is irrelevant; only the file timestamps are compared. However, it is + important to have the clock of the client computers close to that of the + infrastructure, so that timestamps are consistent between the client/HDFS + cluster and that of the object store. Otherwise, changed files may be + missed/copied too often. + +* `distcp.update.modification.time` can be used alongside the checksum check + in stores with same checksum algorithm as well. if set to true we check + both modification time and checksum between the files, but if this property Review Comment: ok. and the default option is "don't use checksums". as i was thinking if we would want to have this on automatically if you are on -skipCrc or the formats are incompatible. but if we leave it something to explicitly ask for, your code looks right ########## hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/CopyMapper.java: ########## @@ -85,6 +85,7 @@ static enum FileAction { private boolean append = false; private boolean verboseLog = false; private boolean directWrite = false; + private boolean useModTimeToUpdate = true; Review Comment: add a constant for the default value > Distcp -update between different cloud stores to use modification time while > checking for file skip. > ---------------------------------------------------------------------------------------------------- > > Key: HADOOP-18596 > URL: https://issues.apache.org/jira/browse/HADOOP-18596 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp > Reporter: Mehakmeet Singh > Assignee: Mehakmeet Singh > Priority: Major > Labels: pull-request-available > > Distcp -update currently relies on File size, block size, and Checksum > comparisons to figure out which files should be skipped or copied. > Since different cloud stores have different checksum algorithms we should > check for modification time as well to the checks. > This would ensure that while performing -update if the files are perceived to > be out of sync we should copy them. The machines between which the file > transfers occur should be in time sync to avoid any extra copies. > Improving testing and documentation for modification time checks between > different object stores to ensure no incorrect skipping of files. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org