mehakmeet commented on code in PR #5308: URL: https://github.com/apache/hadoop/pull/5308#discussion_r1096995207
########## hadoop-tools/hadoop-distcp/src/site/markdown/DistCp.md.vm: ########## @@ -631,14 +631,39 @@ hadoop distcp -update -numListstatusThreads 20 \ Because object stores are slow to list files, consider setting the `-numListstatusThreads` option when performing a `-update` operation on a large directory tree (the limit is 40 threads). -When `DistCp -update` is used with object stores, -generally only the modification time and length of the individual files are compared, -not any checksums. The fact that most object stores do have valid timestamps -for directories is irrelevant; only the file timestamps are compared. -However, it is important to have the clock of the client computers close -to that of the infrastructure, so that timestamps are consistent between -the client/HDFS cluster and that of the object store. Otherwise, changed files may be -missed/copied too often. +When `DistCp -update` is used with object stores, generally only the +modification time and length of the individual files are compared, not any +checksums if the checksum algorithm between the two stores is different. + +* The `distcp -update` between two object stores with different checksum + algorithm compares the modification times of source and target files along + with the file size to determine whether to skip the file copy. The behavior + is controlled by the property `distcp.update.modification.time`, which is + set to true by default. If the source file is more recently modified than + the target file, it is assumed that the content has changed, and the file + should be updated. + We need to ensure that there is no clock skew between the machines. + The fact that most object stores do have valid timestamps for directories + is irrelevant; only the file timestamps are compared. However, it is + important to have the clock of the client computers close to that of the + infrastructure, so that timestamps are consistent between the client/HDFS + cluster and that of the object store. Otherwise, changed files may be + missed/copied too often. + +* `distcp.update.modification.time` can be used alongside the checksum check + in stores with same checksum algorithm as well. if set to true we check + both modification time and checksum between the files, but if this property Review Comment: The timestamps are only used alongside checksums if we have set the config to true, else we would follow the default way that is offered today(So, we can switch off in cases where we know checksums would work). Since S3A/ABFS has checksums disabled we are returned null for the checksum value, we'll always see true for that case, but it can be true for cases where the checksums actually are identical too, so if we rely on checksum check to be true and then don't compare the timestamp, that can give false skips. So, should we check the timestamps inside of the checksum check instead? Like if the checksums for both source and target are not null and if we have the property set to true then do the mod time check? This would add few more changes as we would need to change the params inside different classes to pass the config value as well. We can always have the default value as false and use the property in the cases we want as well to keep the default way as the one offered today too. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org