mehakmeet commented on code in PR #5308:
URL: https://github.com/apache/hadoop/pull/5308#discussion_r1096995207


##########
hadoop-tools/hadoop-distcp/src/site/markdown/DistCp.md.vm:
##########
@@ -631,14 +631,39 @@ hadoop distcp -update -numListstatusThreads 20  \
 Because object stores are slow to list files, consider setting the 
`-numListstatusThreads` option when performing a `-update` operation
 on a large directory tree (the limit is 40 threads).
 
-When `DistCp -update` is used with object stores,
-generally only the modification time and length of the individual files are 
compared,
-not any checksums. The fact that most object stores do have valid timestamps
-for directories is irrelevant; only the file timestamps are compared.
-However, it is important to have the clock of the client computers close
-to that of the infrastructure, so that timestamps are consistent between
-the client/HDFS cluster and that of the object store. Otherwise, changed files 
may be
-missed/copied too often.
+When `DistCp -update` is used with object stores, generally only the
+modification time and length of the individual files are compared, not any
+checksums if the checksum algorithm between the two stores is different.
+
+* The `distcp -update` between two object stores with different checksum
+  algorithm compares the modification times of source and target files along
+  with the file size to determine whether to skip the file copy. The behavior
+  is controlled by the property `distcp.update.modification.time`, which is
+  set to true by default. If the source file is more recently modified than
+  the target file, it is assumed that the content has changed, and the file
+  should be updated.
+  We need to ensure that there is no clock skew between the machines.
+  The fact that most object stores do have valid timestamps for directories
+  is irrelevant; only the file timestamps are compared. However, it is
+  important to have the clock of the client computers close to that of the
+  infrastructure, so that timestamps are consistent between the client/HDFS
+  cluster and that of the object store. Otherwise, changed files may be
+  missed/copied too often.
+
+* `distcp.update.modification.time` can be used alongside the checksum check
+  in stores with same checksum algorithm as well. if set to true we check
+  both modification time and checksum between the files, but if this property

Review Comment:
   The timestamps are only used alongside checksums if we have set the config 
to true, else we would follow the default way that is offered today(So, we can 
switch off in cases where we know checksums would work). 
   
   Since S3A/ABFS has checksums disabled we are returned null for the checksum 
value, we'll always see true for that case, but it can be true for cases where 
the checksums actually are identical too, so if we rely on checksum check to be 
true and then don't compare the timestamp, that can give false skips.
   
   So, should we check the timestamps inside of the checksum check instead? 
Like if the checksums for both source and  target are not null and if we have 
the property set to true then do the mod time check? This would add few more 
changes as we would need to change the params inside different classes to pass 
the config value as well. 
   
   We can always have the default value as false and use the property in the 
cases we want as well to keep the default way as the one offered today too.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to