[
https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17682172#comment-17682172
]
ASF GitHub Bot commented on HADOOP-18596:
-----------------------------------------
steveloughran commented on code in PR #5308:
URL: https://github.com/apache/hadoop/pull/5308#discussion_r1090843134
##########
hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCpConstants.java:
##########
@@ -142,6 +142,13 @@ private DistCpConstants() {
"distcp.blocks.per.chunk";
public static final String CONF_LABEL_USE_ITERATOR = "distcp.use.iterator";
+
+ /** Distcp -update to use modification time of source and target file to
+ * check while skipping.
+ */
+ public static final String CONF_LABEL_UPDATE_MOD_TIME =
Review Comment:
going to need docs.
##########
hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/contract/AbstractContractDistCpTest.java:
##########
@@ -857,4 +862,83 @@ public void testDistCpWithUpdateExistFile() throws
Exception {
verifyFileContents(localFS, dest, block);
}
+ @Test
+ public void testDistCpUpdateCheckFileSkip() throws Exception {
Review Comment:
I'm thinking of a way to test that 0 byte files don't get copied
the testUpdateDeepDirectoryStructureNoChange() test shows how the counters
are used for validation. The new test should validate the files are skipped as
well as checking the contents.
That should be usable to verify that 0 byte files are always skipped,
something we can't do with content validation.
one thing to be aware of is that this test suite isn't implemented by hdfs,
because of the way it creates a new fs every call is too slow. there should be
some specific hdfs to local test we should cover too.
##########
hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/contract/AbstractContractDistCpTest.java:
##########
@@ -857,4 +862,83 @@ public void testDistCpWithUpdateExistFile() throws
Exception {
verifyFileContents(localFS, dest, block);
}
+ @Test
+ public void testDistCpUpdateCheckFileSkip() throws Exception {
+ describe("Distcp update to check file skips.");
+
+ Path source = new Path(remoteDir, "file");
+ Path dest = new Path(localDir, "file");
+ dest = localFS.makeQualified(dest);
+
+ // Creating a source file with certain dataset.
+ byte[] sourceBlock = dataset(10, 'a', 'z');
+
+ // Write the dataset and as well create the target path.
+ try (FSDataOutputStream out = remoteFS.create(source)) {
Review Comment:
if you use `ContractTestUtils.writeDataset()` here the write is followed by
the check that the file is of the correct length; L882 just looks for existence
> Distcp -update between different cloud stores to use modification time while
> checking for file skip.
> ----------------------------------------------------------------------------------------------------
>
> Key: HADOOP-18596
> URL: https://issues.apache.org/jira/browse/HADOOP-18596
> Project: Hadoop Common
> Issue Type: Improvement
> Components: tools/distcp
> Reporter: Mehakmeet Singh
> Assignee: Mehakmeet Singh
> Priority: Major
> Labels: pull-request-available
>
> Distcp -update currently relies on File size, block size, and Checksum
> comparisons to figure out which files should be skipped or copied.
> Since different cloud stores have different checksum algorithms we should
> check for modification time as well to the checks.
> This would ensure that while performing -update if the files are perceived to
> be out of sync we should copy them. The machines between which the file
> transfers occur should be in time sync to avoid any extra copies.
> Improving testing and documentation for modification time checks between
> different object stores to ensure no incorrect skipping of files.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]