[ 
https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17684868#comment-17684868
 ] 

ASF GitHub Bot commented on HADOOP-18596:
-----------------------------------------

steveloughran commented on code in PR #5308:
URL: https://github.com/apache/hadoop/pull/5308#discussion_r1097777011


##########
hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCpConstants.java:
##########
@@ -142,6 +142,19 @@ private DistCpConstants() {
       "distcp.blocks.per.chunk";
 
   public static final String CONF_LABEL_USE_ITERATOR = "distcp.use.iterator";
+
+  /**
+   * Enabling distcp -update to use modification time of source and target

Review Comment:
   nit, use {@code distcp -update} for the better formatting



##########
hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/CopyMapper.java:
##########
@@ -114,6 +115,8 @@ public void setup(Context context) throws IOException, 
InterruptedException {
         PRESERVE_STATUS.getConfigLabel()));
     directWrite = conf.getBoolean(
         DistCpOptionSwitch.DIRECT_WRITE.getConfigLabel(), false);
+    useModTimeToUpdate =
+        conf.getBoolean(DistCpConstants.CONF_LABEL_UPDATE_MOD_TIME, true);

Review Comment:
   refer to that proposed constant for a default value



##########
hadoop-tools/hadoop-distcp/src/site/markdown/DistCp.md.vm:
##########
@@ -631,14 +631,39 @@ hadoop distcp -update -numListstatusThreads 20  \
 Because object stores are slow to list files, consider setting the 
`-numListstatusThreads` option when performing a `-update` operation
 on a large directory tree (the limit is 40 threads).
 
-When `DistCp -update` is used with object stores,
-generally only the modification time and length of the individual files are 
compared,
-not any checksums. The fact that most object stores do have valid timestamps
-for directories is irrelevant; only the file timestamps are compared.
-However, it is important to have the clock of the client computers close
-to that of the infrastructure, so that timestamps are consistent between
-the client/HDFS cluster and that of the object store. Otherwise, changed files 
may be
-missed/copied too often.
+When `DistCp -update` is used with object stores, generally only the
+modification time and length of the individual files are compared, not any
+checksums if the checksum algorithm between the two stores is different.
+
+* The `distcp -update` between two object stores with different checksum
+  algorithm compares the modification times of source and target files along
+  with the file size to determine whether to skip the file copy. The behavior
+  is controlled by the property `distcp.update.modification.time`, which is
+  set to true by default. If the source file is more recently modified than
+  the target file, it is assumed that the content has changed, and the file
+  should be updated.
+  We need to ensure that there is no clock skew between the machines.
+  The fact that most object stores do have valid timestamps for directories
+  is irrelevant; only the file timestamps are compared. However, it is
+  important to have the clock of the client computers close to that of the
+  infrastructure, so that timestamps are consistent between the client/HDFS
+  cluster and that of the object store. Otherwise, changed files may be
+  missed/copied too often.
+
+* `distcp.update.modification.time` can be used alongside the checksum check
+  in stores with same checksum algorithm as well. if set to true we check
+  both modification time and checksum between the files, but if this property

Review Comment:
   ok. and the default option is "don't use checksums". as i was thinking if we 
would want to have this on automatically if you are on -skipCrc or the formats 
are incompatible.
   
   but if we leave it something to explicitly ask for, your code looks right



##########
hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/CopyMapper.java:
##########
@@ -85,6 +85,7 @@ static enum FileAction {
   private boolean append = false;
   private boolean verboseLog = false;
   private boolean directWrite = false;
+  private boolean useModTimeToUpdate = true;

Review Comment:
   add a constant for the default value





> Distcp -update between different cloud stores to use modification time while 
> checking for file skip.
> ----------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-18596
>                 URL: https://issues.apache.org/jira/browse/HADOOP-18596
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: tools/distcp
>            Reporter: Mehakmeet Singh
>            Assignee: Mehakmeet Singh
>            Priority: Major
>              Labels: pull-request-available
>
> Distcp -update currently relies on File size, block size, and Checksum 
> comparisons to figure out which files should be skipped or copied. 
> Since different cloud stores have different checksum algorithms we should 
> check for modification time as well to the checks.
> This would ensure that while performing -update if the files are perceived to 
> be out of sync we should copy them. The machines between which the file 
> transfers occur should be in time sync to avoid any extra copies.
> Improving testing and documentation for modification time checks between 
> different object stores to ensure no incorrect skipping of files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to