steveloughran commented on code in PR #5308: URL: https://github.com/apache/hadoop/pull/5308#discussion_r1097777011
########## hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCpConstants.java: ########## @@ -142,6 +142,19 @@ private DistCpConstants() { "distcp.blocks.per.chunk"; public static final String CONF_LABEL_USE_ITERATOR = "distcp.use.iterator"; + + /** + * Enabling distcp -update to use modification time of source and target Review Comment: nit, use {@code distcp -update} for the better formatting ########## hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/CopyMapper.java: ########## @@ -114,6 +115,8 @@ public void setup(Context context) throws IOException, InterruptedException { PRESERVE_STATUS.getConfigLabel())); directWrite = conf.getBoolean( DistCpOptionSwitch.DIRECT_WRITE.getConfigLabel(), false); + useModTimeToUpdate = + conf.getBoolean(DistCpConstants.CONF_LABEL_UPDATE_MOD_TIME, true); Review Comment: refer to that proposed constant for a default value ########## hadoop-tools/hadoop-distcp/src/site/markdown/DistCp.md.vm: ########## @@ -631,14 +631,39 @@ hadoop distcp -update -numListstatusThreads 20 \ Because object stores are slow to list files, consider setting the `-numListstatusThreads` option when performing a `-update` operation on a large directory tree (the limit is 40 threads). -When `DistCp -update` is used with object stores, -generally only the modification time and length of the individual files are compared, -not any checksums. The fact that most object stores do have valid timestamps -for directories is irrelevant; only the file timestamps are compared. -However, it is important to have the clock of the client computers close -to that of the infrastructure, so that timestamps are consistent between -the client/HDFS cluster and that of the object store. Otherwise, changed files may be -missed/copied too often. +When `DistCp -update` is used with object stores, generally only the +modification time and length of the individual files are compared, not any +checksums if the checksum algorithm between the two stores is different. + +* The `distcp -update` between two object stores with different checksum + algorithm compares the modification times of source and target files along + with the file size to determine whether to skip the file copy. The behavior + is controlled by the property `distcp.update.modification.time`, which is + set to true by default. If the source file is more recently modified than + the target file, it is assumed that the content has changed, and the file + should be updated. + We need to ensure that there is no clock skew between the machines. + The fact that most object stores do have valid timestamps for directories + is irrelevant; only the file timestamps are compared. However, it is + important to have the clock of the client computers close to that of the + infrastructure, so that timestamps are consistent between the client/HDFS + cluster and that of the object store. Otherwise, changed files may be + missed/copied too often. + +* `distcp.update.modification.time` can be used alongside the checksum check + in stores with same checksum algorithm as well. if set to true we check + both modification time and checksum between the files, but if this property Review Comment: ok. and the default option is "don't use checksums". as i was thinking if we would want to have this on automatically if you are on -skipCrc or the formats are incompatible. but if we leave it something to explicitly ask for, your code looks right ########## hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/CopyMapper.java: ########## @@ -85,6 +85,7 @@ static enum FileAction { private boolean append = false; private boolean verboseLog = false; private boolean directWrite = false; + private boolean useModTimeToUpdate = true; Review Comment: add a constant for the default value -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org