[
https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17684868#comment-17684868
]
ASF GitHub Bot commented on HADOOP-18596:
-----------------------------------------
steveloughran commented on code in PR #5308:
URL: https://github.com/apache/hadoop/pull/5308#discussion_r1097777011
##########
hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCpConstants.java:
##########
@@ -142,6 +142,19 @@ private DistCpConstants() {
"distcp.blocks.per.chunk";
public static final String CONF_LABEL_USE_ITERATOR = "distcp.use.iterator";
+
+ /**
+ * Enabling distcp -update to use modification time of source and target
Review Comment:
nit, use {@code distcp -update} for the better formatting
##########
hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/CopyMapper.java:
##########
@@ -114,6 +115,8 @@ public void setup(Context context) throws IOException,
InterruptedException {
PRESERVE_STATUS.getConfigLabel()));
directWrite = conf.getBoolean(
DistCpOptionSwitch.DIRECT_WRITE.getConfigLabel(), false);
+ useModTimeToUpdate =
+ conf.getBoolean(DistCpConstants.CONF_LABEL_UPDATE_MOD_TIME, true);
Review Comment:
refer to that proposed constant for a default value
##########
hadoop-tools/hadoop-distcp/src/site/markdown/DistCp.md.vm:
##########
@@ -631,14 +631,39 @@ hadoop distcp -update -numListstatusThreads 20 \
Because object stores are slow to list files, consider setting the
`-numListstatusThreads` option when performing a `-update` operation
on a large directory tree (the limit is 40 threads).
-When `DistCp -update` is used with object stores,
-generally only the modification time and length of the individual files are
compared,
-not any checksums. The fact that most object stores do have valid timestamps
-for directories is irrelevant; only the file timestamps are compared.
-However, it is important to have the clock of the client computers close
-to that of the infrastructure, so that timestamps are consistent between
-the client/HDFS cluster and that of the object store. Otherwise, changed files
may be
-missed/copied too often.
+When `DistCp -update` is used with object stores, generally only the
+modification time and length of the individual files are compared, not any
+checksums if the checksum algorithm between the two stores is different.
+
+* The `distcp -update` between two object stores with different checksum
+ algorithm compares the modification times of source and target files along
+ with the file size to determine whether to skip the file copy. The behavior
+ is controlled by the property `distcp.update.modification.time`, which is
+ set to true by default. If the source file is more recently modified than
+ the target file, it is assumed that the content has changed, and the file
+ should be updated.
+ We need to ensure that there is no clock skew between the machines.
+ The fact that most object stores do have valid timestamps for directories
+ is irrelevant; only the file timestamps are compared. However, it is
+ important to have the clock of the client computers close to that of the
+ infrastructure, so that timestamps are consistent between the client/HDFS
+ cluster and that of the object store. Otherwise, changed files may be
+ missed/copied too often.
+
+* `distcp.update.modification.time` can be used alongside the checksum check
+ in stores with same checksum algorithm as well. if set to true we check
+ both modification time and checksum between the files, but if this property
Review Comment:
ok. and the default option is "don't use checksums". as i was thinking if we
would want to have this on automatically if you are on -skipCrc or the formats
are incompatible.
but if we leave it something to explicitly ask for, your code looks right
##########
hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/CopyMapper.java:
##########
@@ -85,6 +85,7 @@ static enum FileAction {
private boolean append = false;
private boolean verboseLog = false;
private boolean directWrite = false;
+ private boolean useModTimeToUpdate = true;
Review Comment:
add a constant for the default value
> Distcp -update between different cloud stores to use modification time while
> checking for file skip.
> ----------------------------------------------------------------------------------------------------
>
> Key: HADOOP-18596
> URL: https://issues.apache.org/jira/browse/HADOOP-18596
> Project: Hadoop Common
> Issue Type: Improvement
> Components: tools/distcp
> Reporter: Mehakmeet Singh
> Assignee: Mehakmeet Singh
> Priority: Major
> Labels: pull-request-available
>
> Distcp -update currently relies on File size, block size, and Checksum
> comparisons to figure out which files should be skipped or copied.
> Since different cloud stores have different checksum algorithms we should
> check for modification time as well to the checks.
> This would ensure that while performing -update if the files are perceived to
> be out of sync we should copy them. The machines between which the file
> transfers occur should be in time sync to avoid any extra copies.
> Improving testing and documentation for modification time checks between
> different object stores to ensure no incorrect skipping of files.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]