[jira] [Commented] (HADOOP-18596) Distcp -update between different cloud stores to use modification time while checking for file skip.
[ https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17736275#comment-17736275 ] Wei-Chiu Chuang commented on HADOOP-18596: -- Yep. Agrreed. I think the right fix is on the hbase-filesystem side to update the test. > Distcp -update between different cloud stores to use modification time while > checking for file skip. > > > Key: HADOOP-18596 > URL: https://issues.apache.org/jira/browse/HADOOP-18596 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Reporter: Mehakmeet Singh >Assignee: Mehakmeet Singh >Priority: Major > Labels: pull-request-available > Fix For: 3.3.6 > > > Distcp -update currently relies on File size, block size, and Checksum > comparisons to figure out which files should be skipped or copied. > Since different cloud stores have different checksum algorithms we should > check for modification time as well to the checks. > This would ensure that while performing -update if the files are perceived to > be out of sync we should copy them. The machines between which the file > transfers occur should be in time sync to avoid any extra copies. > Improving testing and documentation for modification time checks between > different object stores to ensure no incorrect skipping of files. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-18596) Distcp -update between different cloud stores to use modification time while checking for file skip.
[ https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17735771#comment-17735771 ] Wei-Chiu Chuang commented on HADOOP-18596: -- Hi [~mehakmeetSingh] in case you missed the Hadoop 3.3.6 vote thread in the Hadoop dev mailing lists, here's the excerpt: [ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 9.007 s <<< FAILURE! - in org.apache.hadoop.hbase.regionserver.TestSyncTimeRangeTracker [ERROR] org.apache.hadoop.hbase.regionserver.TestSyncTimeRangeTracker.testConcurrentIncludeTimestampCorrectness Time elapsed: 3.13 s <<< ERROR! java.lang.OutOfMemoryError: Java heap space at org.apache.hadoop.hbase.regionserver.TestSyncTimeRangeTracker$RandomTestData.(TestSyncTimeRangeTracker.java:91) at org.apache.hadoop.hbase.regionserver.TestSyncTimeRangeTracker.testConcurrentIncludeTimestampCorrectness(TestSyncTimeRangeTracker.java:156) bq. hbase-filesystem has three test failures in TestHBOSSContractDistCp, and is not reproducible with Hadoop 3.3.5. bq. [ERROR] Failures: [ERROR] TestHBOSSContractDistCp>AbstractContractDistCpTest.testDistCpUpdateCheckFileSkip:976->Assert.fail:88 10 errors in file of length 10 bq. [ERROR] TestHBOSSContractDistCp>AbstractContractDistCpTest.testUpdateDeepDirectoryStructureNoChange:270->AbstractContractDistCpTest.assertCounterInRange:290->Assert.assertTrue:41->Assert.fail:88 Files Skipped value 0 too below minimum 1 bq. [ERROR] TestHBOSSContractDistCp>AbstractContractDistCpTest.testUpdateDeepDirectoryStructureToRemote:259->AbstractContractDistCpTest.distCpUpdateDeepDirectoryStructure:334->AbstractContractDistCpTest.assertCounterInRange:294->Assert.assertTrue:41->Assert.fail:88 Files Copied value 2 above maximum 1 bq. [INFO] bq. [ERROR] Tests run: 240, Failures: 3, Errors: 0, Skipped: 58 bq. bq. > Distcp -update between different cloud stores to use modification time while > checking for file skip. > > > Key: HADOOP-18596 > URL: https://issues.apache.org/jira/browse/HADOOP-18596 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Reporter: Mehakmeet Singh >Assignee: Mehakmeet Singh >Priority: Major > Labels: pull-request-available > Fix For: 3.3.6 > > > Distcp -update currently relies on File size, block size, and Checksum > comparisons to figure out which files should be skipped or copied. > Since different cloud stores have different checksum algorithms we should > check for modification time as well to the checks. > This would ensure that while performing -update if the files are perceived to > be out of sync we should copy them. The machines between which the file > transfers occur should be in time sync to avoid any extra copies. > Improving testing and documentation for modification time checks between > different object stores to ensure no incorrect skipping of files. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-18596) Distcp -update between different cloud stores to use modification time while checking for file skip.
[ https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17735579#comment-17735579 ] Mehakmeet Singh commented on HADOOP-18596: -- [~weichiu] Sorry seems like this comment got lost in my emails. Can you please point to the failed hbase Filesystem test? Is it the same as https://issues.apache.org/jira/browse/HADOOP-18633 and still failing even after the fix? In terms of the behavior, I believe we want this to be turned on by default since this would be required in handling incorrect file skips for distcp updates when checksums are not compatible. > Distcp -update between different cloud stores to use modification time while > checking for file skip. > > > Key: HADOOP-18596 > URL: https://issues.apache.org/jira/browse/HADOOP-18596 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Reporter: Mehakmeet Singh >Assignee: Mehakmeet Singh >Priority: Major > Labels: pull-request-available > Fix For: 3.3.6 > > > Distcp -update currently relies on File size, block size, and Checksum > comparisons to figure out which files should be skipped or copied. > Since different cloud stores have different checksum algorithms we should > check for modification time as well to the checks. > This would ensure that while performing -update if the files are perceived to > be out of sync we should copy them. The machines between which the file > transfers occur should be in time sync to avoid any extra copies. > Improving testing and documentation for modification time checks between > different object stores to ensure no incorrect skipping of files. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-18596) Distcp -update between different cloud stores to use modification time while checking for file skip.
[ https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17733813#comment-17733813 ] Wei-Chiu Chuang commented on HADOOP-18596: -- This change and HADOOP-18633 failed a hbase-filesystem test. Should this behavior be turned on by default or should we fix the test in hbase-filesystem? > Distcp -update between different cloud stores to use modification time while > checking for file skip. > > > Key: HADOOP-18596 > URL: https://issues.apache.org/jira/browse/HADOOP-18596 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Reporter: Mehakmeet Singh >Assignee: Mehakmeet Singh >Priority: Major > Labels: pull-request-available > Fix For: 3.3.6 > > > Distcp -update currently relies on File size, block size, and Checksum > comparisons to figure out which files should be skipped or copied. > Since different cloud stores have different checksum algorithms we should > check for modification time as well to the checks. > This would ensure that while performing -update if the files are perceived to > be out of sync we should copy them. The machines between which the file > transfers occur should be in time sync to avoid any extra copies. > Improving testing and documentation for modification time checks between > different object stores to ensure no incorrect skipping of files. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-18596) Distcp -update between different cloud stores to use modification time while checking for file skip.
[ https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688975#comment-17688975 ] Mehakmeet Singh commented on HADOOP-18596: -- [~ayushtkn] Thanks for pointing it out, I'll open a new Jira and address this, think I see where the issue is. > Distcp -update between different cloud stores to use modification time while > checking for file skip. > > > Key: HADOOP-18596 > URL: https://issues.apache.org/jira/browse/HADOOP-18596 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Reporter: Mehakmeet Singh >Assignee: Mehakmeet Singh >Priority: Major > Labels: pull-request-available > > Distcp -update currently relies on File size, block size, and Checksum > comparisons to figure out which files should be skipped or copied. > Since different cloud stores have different checksum algorithms we should > check for modification time as well to the checks. > This would ensure that while performing -update if the files are perceived to > be out of sync we should copy them. The machines between which the file > transfers occur should be in time sync to avoid any extra copies. > Improving testing and documentation for modification time checks between > different object stores to ensure no incorrect skipping of files. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-18596) Distcp -update between different cloud stores to use modification time while checking for file skip.
[ https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688959#comment-17688959 ] Ayush Saxena commented on HADOOP-18596: --- The introduced test seems flaky, fails always for me locally and it failed in the daily build as well. Can you check: https://ci-hadoop.apache.org/view/Hadoop/job/hadoop-qbt-trunk-java8-linux-x86_64/1133/testReport/org.apache.hadoop.tools.contract/TestLocalContractDistCp/testDistCpUpdateCheckFileSkip/ > Distcp -update between different cloud stores to use modification time while > checking for file skip. > > > Key: HADOOP-18596 > URL: https://issues.apache.org/jira/browse/HADOOP-18596 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Reporter: Mehakmeet Singh >Assignee: Mehakmeet Singh >Priority: Major > Labels: pull-request-available > > Distcp -update currently relies on File size, block size, and Checksum > comparisons to figure out which files should be skipped or copied. > Since different cloud stores have different checksum algorithms we should > check for modification time as well to the checks. > This would ensure that while performing -update if the files are perceived to > be out of sync we should copy them. The machines between which the file > transfers occur should be in time sync to avoid any extra copies. > Improving testing and documentation for modification time checks between > different object stores to ensure no incorrect skipping of files. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-18596) Distcp -update between different cloud stores to use modification time while checking for file skip.
[ https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688400#comment-17688400 ] ASF GitHub Bot commented on HADOOP-18596: - mehakmeet merged PR #5387: URL: https://github.com/apache/hadoop/pull/5387 > Distcp -update between different cloud stores to use modification time while > checking for file skip. > > > Key: HADOOP-18596 > URL: https://issues.apache.org/jira/browse/HADOOP-18596 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Reporter: Mehakmeet Singh >Assignee: Mehakmeet Singh >Priority: Major > Labels: pull-request-available > > Distcp -update currently relies on File size, block size, and Checksum > comparisons to figure out which files should be skipped or copied. > Since different cloud stores have different checksum algorithms we should > check for modification time as well to the checks. > This would ensure that while performing -update if the files are perceived to > be out of sync we should copy them. The machines between which the file > transfers occur should be in time sync to avoid any extra copies. > Improving testing and documentation for modification time checks between > different object stores to ensure no incorrect skipping of files. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-18596) Distcp -update between different cloud stores to use modification time while checking for file skip.
[ https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687862#comment-17687862 ] ASF GitHub Bot commented on HADOOP-18596: - hadoop-yetus commented on PR #5387: URL: https://github.com/apache/hadoop/pull/5387#issuecomment-1427676401 :confetti_ball: **+1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 10m 29s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 2 new or modified test files. | _ branch-3.3 Compile Tests _ | | +1 :green_heart: | mvninstall | 43m 4s | | branch-3.3 passed | | +1 :green_heart: | compile | 0m 26s | | branch-3.3 passed | | +1 :green_heart: | checkstyle | 0m 27s | | branch-3.3 passed | | +1 :green_heart: | mvnsite | 0m 32s | | branch-3.3 passed | | +1 :green_heart: | javadoc | 0m 30s | | branch-3.3 passed | | +1 :green_heart: | spotbugs | 0m 55s | | branch-3.3 passed | | +1 :green_heart: | shadedclient | 28m 23s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 0m 34s | | the patch passed | | +1 :green_heart: | compile | 0m 21s | | the patch passed | | +1 :green_heart: | javac | 0m 21s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 0m 14s | | the patch passed | | +1 :green_heart: | mvnsite | 0m 23s | | the patch passed | | +1 :green_heart: | javadoc | 0m 16s | | the patch passed | | +1 :green_heart: | spotbugs | 0m 48s | | the patch passed | | +1 :green_heart: | shadedclient | 27m 58s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 16m 3s | | hadoop-distcp in the patch passed. | | +1 :green_heart: | asflicense | 0m 33s | | The patch does not generate ASF License warnings. | | | | 132m 58s | | | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.42 ServerAPI=1.42 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5387/1/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/5387 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux a974ffad64bf 4.15.0-200-generic #211-Ubuntu SMP Thu Nov 24 18:16:04 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | branch-3.3 / 399de84d4e44775c948729f62206321ef1081338 | | Default Java | Private Build-1.8.0_352-8u352-ga-1~18.04-b08 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5387/1/testReport/ | | Max. process+thread count | 618 (vs. ulimit of 5500) | | modules | C: hadoop-tools/hadoop-distcp U: hadoop-tools/hadoop-distcp | | Console output | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5387/1/console | | versions | git=2.17.1 maven=3.6.0 spotbugs=4.2.2 | | Powered by | Apache Yetus 0.14.0 https://yetus.apache.org | This message was automatically generated. > Distcp -update between different cloud stores to use modification time while > checking for file skip. > > > Key: HADOOP-18596 > URL: https://issues.apache.org/jira/browse/HADOOP-18596 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Reporter: Mehakmeet Singh >Assignee: Mehakmeet Singh >Priority: Major > Labels: pull-request-available > > Distcp -update currently relies on File size, block size, and Checksum > comparisons to figure out which files should be skipped or copied. > Since different cloud stores have different checksum algorithms we should > check for modification time as well to the checks. > This would ensure that while performing -update if the files are perceived to > be out of sync we should copy them. The machines between which the file >
[jira] [Commented] (HADOOP-18596) Distcp -update between different cloud stores to use modification time while checking for file skip.
[ https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687783#comment-17687783 ] ASF GitHub Bot commented on HADOOP-18596: - mehakmeet opened a new pull request, #5387: URL: https://github.com/apache/hadoop/pull/5387 ### Description of PR Adding toggleable support for modification time during distcp -update between two stores with incompatible checksum comparison. ### How was this patch tested? Compiled and ran the added tests on ABFS and S3A. ### For code changes: - [ ] Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')? - [ ] Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation? - [ ] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)? - [ ] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, `NOTICE-binary` files? > Distcp -update between different cloud stores to use modification time while > checking for file skip. > > > Key: HADOOP-18596 > URL: https://issues.apache.org/jira/browse/HADOOP-18596 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Reporter: Mehakmeet Singh >Assignee: Mehakmeet Singh >Priority: Major > Labels: pull-request-available > > Distcp -update currently relies on File size, block size, and Checksum > comparisons to figure out which files should be skipped or copied. > Since different cloud stores have different checksum algorithms we should > check for modification time as well to the checks. > This would ensure that while performing -update if the files are perceived to > be out of sync we should copy them. The machines between which the file > transfers occur should be in time sync to avoid any extra copies. > Improving testing and documentation for modification time checks between > different object stores to ensure no incorrect skipping of files. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-18596) Distcp -update between different cloud stores to use modification time while checking for file skip.
[ https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17686715#comment-17686715 ] ASF GitHub Bot commented on HADOOP-18596: - steveloughran commented on PR #5308: URL: https://github.com/apache/hadoop/pull/5308#issuecomment-1424675867 ok, you can backport to 3.3, but not to the 3.3.5 branch > Distcp -update between different cloud stores to use modification time while > checking for file skip. > > > Key: HADOOP-18596 > URL: https://issues.apache.org/jira/browse/HADOOP-18596 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Reporter: Mehakmeet Singh >Assignee: Mehakmeet Singh >Priority: Major > Labels: pull-request-available > > Distcp -update currently relies on File size, block size, and Checksum > comparisons to figure out which files should be skipped or copied. > Since different cloud stores have different checksum algorithms we should > check for modification time as well to the checks. > This would ensure that while performing -update if the files are perceived to > be out of sync we should copy them. The machines between which the file > transfers occur should be in time sync to avoid any extra copies. > Improving testing and documentation for modification time checks between > different object stores to ensure no incorrect skipping of files. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-18596) Distcp -update between different cloud stores to use modification time while checking for file skip.
[ https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17686626#comment-17686626 ] ASF GitHub Bot commented on HADOOP-18596: - mehakmeet merged PR #5308: URL: https://github.com/apache/hadoop/pull/5308 > Distcp -update between different cloud stores to use modification time while > checking for file skip. > > > Key: HADOOP-18596 > URL: https://issues.apache.org/jira/browse/HADOOP-18596 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Reporter: Mehakmeet Singh >Assignee: Mehakmeet Singh >Priority: Major > Labels: pull-request-available > > Distcp -update currently relies on File size, block size, and Checksum > comparisons to figure out which files should be skipped or copied. > Since different cloud stores have different checksum algorithms we should > check for modification time as well to the checks. > This would ensure that while performing -update if the files are perceived to > be out of sync we should copy them. The machines between which the file > transfers occur should be in time sync to avoid any extra copies. > Improving testing and documentation for modification time checks between > different object stores to ensure no incorrect skipping of files. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-18596) Distcp -update between different cloud stores to use modification time while checking for file skip.
[ https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17686437#comment-17686437 ] ASF GitHub Bot commented on HADOOP-18596: - hadoop-yetus commented on PR #5308: URL: https://github.com/apache/hadoop/pull/5308#issuecomment-1424149799 :confetti_ball: **+1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 38s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 2 new or modified test files. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 43m 24s | | trunk passed | | +1 :green_heart: | compile | 0m 35s | | trunk passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | compile | 0m 31s | | trunk passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | checkstyle | 0m 34s | | trunk passed | | +1 :green_heart: | mvnsite | 0m 37s | | trunk passed | | +1 :green_heart: | javadoc | 0m 36s | | trunk passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | javadoc | 0m 30s | | trunk passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | spotbugs | 0m 59s | | trunk passed | | +1 :green_heart: | shadedclient | 23m 17s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 0m 32s | | the patch passed | | +1 :green_heart: | compile | 0m 26s | | the patch passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | javac | 0m 26s | | the patch passed | | +1 :green_heart: | compile | 0m 23s | | the patch passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | javac | 0m 23s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 0m 17s | | the patch passed | | +1 :green_heart: | mvnsite | 0m 26s | | the patch passed | | +1 :green_heart: | javadoc | 0m 21s | | the patch passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | javadoc | 0m 19s | | the patch passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | spotbugs | 0m 51s | | the patch passed | | +1 :green_heart: | shadedclient | 23m 15s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 14m 57s | | hadoop-distcp in the patch passed. | | +1 :green_heart: | asflicense | 0m 38s | | The patch does not generate ASF License warnings. | | | | 115m 32s | | | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.42 ServerAPI=1.42 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5308/9/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/5308 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux 5f6ce632d6e0 4.15.0-200-generic #211-Ubuntu SMP Thu Nov 24 18:16:04 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 58d8f84aa532f953da99ab8fe5bed9c28ea442f9 | | Default Java | Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5308/9/testReport/ | | Max. process+thread count | 568 (vs. ulimit of 5500) | | modules | C: hadoop-tools/hadoop-distcp U: hadoop-tools/hadoop-distcp | | Console output | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5308/9/console | | versions | git=2.25.1 maven=3.6.3 spotbugs=4.2.2 | | Powered by | Apache Yetus 0.14.0 https://yetus.apache.org | This message was automatically generated. > Distcp -update between different cloud stores to use modification time while > checking for
[jira] [Commented] (HADOOP-18596) Distcp -update between different cloud stores to use modification time while checking for file skip.
[ https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17686365#comment-17686365 ] ASF GitHub Bot commented on HADOOP-18596: - mehakmeet commented on code in PR #5308: URL: https://github.com/apache/hadoop/pull/5308#discussion_r1101287643 ## hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/mapred/TestCopyCommitter.java: ## @@ -562,11 +562,12 @@ private void testCommitWithChecksumMismatch(boolean skipCrc) Path sourcePath = new Path(sourceBase + srcFilename); CopyListingFileStatus sourceCurrStatus = new CopyListingFileStatus(fs.getFileStatus(sourcePath)); -Assert.assertFalse(!DistCpUtils.checksumsAreEqual( -fs, new Path(sourceBase + srcFilename), null, -fs, new Path(targetBase + srcFilename), -sourceCurrStatus.getLen()) -.equals(CopyMapper.ChecksumComparison.FALSE)); +Assert.assertEquals("Checksum should not be equal", +DistCpUtils.checksumsAreEqual( +fs, new Path(sourceBase + srcFilename), null, +fs, new Path(targetBase + srcFilename), +sourceCurrStatus.getLen()), +CopyMapper.ChecksumComparison.FALSE); Review Comment: ooh, this is an old test, I changed the assertFalse to asserEquals but didn't realize the mistake I made. Thanks. > Distcp -update between different cloud stores to use modification time while > checking for file skip. > > > Key: HADOOP-18596 > URL: https://issues.apache.org/jira/browse/HADOOP-18596 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Reporter: Mehakmeet Singh >Assignee: Mehakmeet Singh >Priority: Major > Labels: pull-request-available > > Distcp -update currently relies on File size, block size, and Checksum > comparisons to figure out which files should be skipped or copied. > Since different cloud stores have different checksum algorithms we should > check for modification time as well to the checks. > This would ensure that while performing -update if the files are perceived to > be out of sync we should copy them. The machines between which the file > transfers occur should be in time sync to avoid any extra copies. > Improving testing and documentation for modification time checks between > different object stores to ensure no incorrect skipping of files. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-18596) Distcp -update between different cloud stores to use modification time while checking for file skip.
[ https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17686362#comment-17686362 ] ASF GitHub Bot commented on HADOOP-18596: - steveloughran commented on code in PR #5308: URL: https://github.com/apache/hadoop/pull/5308#discussion_r1101279463 ## hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/mapred/TestCopyCommitter.java: ## @@ -562,11 +562,12 @@ private void testCommitWithChecksumMismatch(boolean skipCrc) Path sourcePath = new Path(sourceBase + srcFilename); CopyListingFileStatus sourceCurrStatus = new CopyListingFileStatus(fs.getFileStatus(sourcePath)); -Assert.assertFalse(!DistCpUtils.checksumsAreEqual( -fs, new Path(sourceBase + srcFilename), null, -fs, new Path(targetBase + srcFilename), -sourceCurrStatus.getLen()) -.equals(CopyMapper.ChecksumComparison.FALSE)); +Assert.assertEquals("Checksum should not be equal", +DistCpUtils.checksumsAreEqual( +fs, new Path(sourceBase + srcFilename), null, +fs, new Path(targetBase + srcFilename), +sourceCurrStatus.getLen()), +CopyMapper.ChecksumComparison.FALSE); Review Comment: good test, but you need to put the expected value first, so that assertEquals prints the right "expected 1 actual 2" message. bit of PITA > Distcp -update between different cloud stores to use modification time while > checking for file skip. > > > Key: HADOOP-18596 > URL: https://issues.apache.org/jira/browse/HADOOP-18596 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Reporter: Mehakmeet Singh >Assignee: Mehakmeet Singh >Priority: Major > Labels: pull-request-available > > Distcp -update currently relies on File size, block size, and Checksum > comparisons to figure out which files should be skipped or copied. > Since different cloud stores have different checksum algorithms we should > check for modification time as well to the checks. > This would ensure that while performing -update if the files are perceived to > be out of sync we should copy them. The machines between which the file > transfers occur should be in time sync to avoid any extra copies. > Improving testing and documentation for modification time checks between > different object stores to ensure no incorrect skipping of files. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-18596) Distcp -update between different cloud stores to use modification time while checking for file skip.
[ https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17686267#comment-17686267 ] ASF GitHub Bot commented on HADOOP-18596: - hadoop-yetus commented on PR #5308: URL: https://github.com/apache/hadoop/pull/5308#issuecomment-1423801752 :confetti_ball: **+1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 39s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 1s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 1s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 2 new or modified test files. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 43m 23s | | trunk passed | | +1 :green_heart: | compile | 0m 37s | | trunk passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | compile | 0m 31s | | trunk passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | checkstyle | 0m 33s | | trunk passed | | +1 :green_heart: | mvnsite | 0m 36s | | trunk passed | | +1 :green_heart: | javadoc | 0m 38s | | trunk passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | javadoc | 0m 29s | | trunk passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | spotbugs | 0m 59s | | trunk passed | | +1 :green_heart: | shadedclient | 23m 4s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 0m 32s | | the patch passed | | +1 :green_heart: | compile | 0m 25s | | the patch passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | javac | 0m 25s | | the patch passed | | +1 :green_heart: | compile | 0m 23s | | the patch passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | javac | 0m 23s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 0m 16s | | the patch passed | | +1 :green_heart: | mvnsite | 0m 26s | | the patch passed | | +1 :green_heart: | javadoc | 0m 20s | | the patch passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | javadoc | 0m 20s | | the patch passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | spotbugs | 0m 50s | | the patch passed | | +1 :green_heart: | shadedclient | 23m 1s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 15m 17s | | hadoop-distcp in the patch passed. | | +1 :green_heart: | asflicense | 0m 37s | | The patch does not generate ASF License warnings. | | | | 115m 30s | | | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.42 ServerAPI=1.42 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5308/8/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/5308 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux 51f2095ec10e 4.15.0-200-generic #211-Ubuntu SMP Thu Nov 24 18:16:04 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 0f63b45b01ad69d7ccc810a52b22dbcfbab4c0cc | | Default Java | Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5308/8/testReport/ | | Max. process+thread count | 768 (vs. ulimit of 5500) | | modules | C: hadoop-tools/hadoop-distcp U: hadoop-tools/hadoop-distcp | | Console output | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5308/8/console | | versions | git=2.25.1 maven=3.6.3 spotbugs=4.2.2 | | Powered by | Apache Yetus 0.14.0 https://yetus.apache.org | This message was automatically generated. > Distcp -update between different cloud stores to use modification time while > checking for
[jira] [Commented] (HADOOP-18596) Distcp -update between different cloud stores to use modification time while checking for file skip.
[ https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17686220#comment-17686220 ] ASF GitHub Bot commented on HADOOP-18596: - mehakmeet commented on PR #5308: URL: https://github.com/apache/hadoop/pull/5308#issuecomment-1423704609 Have made the changes @steveloughran suggested including changing ">" to ">=". Feel like we can have both strictly greater or greater equals for the check, the latter we would be taking a slight risk that the source file may have changed at the same time the last sync took place and we would be skipping the copy in that case, and the former in which we can have an additional copy even if there's no content changed but the mod time is same for both source and target. Shouldn't we prioritize accuracy here? Any more thoughts on if we should change this or keep ">="? > Distcp -update between different cloud stores to use modification time while > checking for file skip. > > > Key: HADOOP-18596 > URL: https://issues.apache.org/jira/browse/HADOOP-18596 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Reporter: Mehakmeet Singh >Assignee: Mehakmeet Singh >Priority: Major > Labels: pull-request-available > > Distcp -update currently relies on File size, block size, and Checksum > comparisons to figure out which files should be skipped or copied. > Since different cloud stores have different checksum algorithms we should > check for modification time as well to the checks. > This would ensure that while performing -update if the files are perceived to > be out of sync we should copy them. The machines between which the file > transfers occur should be in time sync to avoid any extra copies. > Improving testing and documentation for modification time checks between > different object stores to ensure no incorrect skipping of files. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-18596) Distcp -update between different cloud stores to use modification time while checking for file skip.
[ https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685982#comment-17685982 ] ASF GitHub Bot commented on HADOOP-18596: - mehakmeet commented on code in PR #5308: URL: https://github.com/apache/hadoop/pull/5308#discussion_r1100421615 ## hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/CopyMapper.java: ## @@ -361,31 +373,60 @@ private boolean canSkip(FileSystem sourceFS, CopyListingFileStatus source, if (sameLength && source.getLen() == 0) { return true; } -// if both the source and target have the same length, then check if the -// config to use modification time is set to true, then use the -// modification time and checksum comparison to determine if the copy can -// be skipped else if not set then just use the checksum comparison to -// check copy skip. +// If the src and target file have same size and block size, we would +// check if the checkCrc flag is enabled or not. If enabled, and the +// modTime comparison is enabled then return true if target file is older +// than the source file, since this indicates that the target file is +// recently updated and the source is not changed more recently than the +// update, we can skip the copy else we would copy. +// If skipCrc flag is disabled, we would check the checksum comparison +// which is an enum representing 3 values, of which if the comparison +// returns NOT_COMPATIBLE, we'll try to check modtime again, else return +// the result of checksum comparison which are compatible(true or false). // // Note: Different object stores can have different checksum algorithms // resulting in no checksum comparison that results in return true // always, having the modification time enabled can help in these // scenarios to not incorrectly skip a copy. Refer: HADOOP-18596. + if (sameLength && sameBlockSize) { - if (useModTimeToUpdate) { -return -(source.getModificationTime() < target.getModificationTime()) && -(skipCrc || DistCpUtils.checksumsAreEqual(sourceFS, -source.getPath(), null, -targetFS, target.getPath(), source.getLen())); + if (skipCrc) { +return maybeUseModTimeToCompare(source, target); } else { -return skipCrc || DistCpUtils +ChecksumComparison checksumComparison = DistCpUtils .checksumsAreEqual(sourceFS, source.getPath(), null, targetFS, target.getPath(), source.getLen()); +LOG.debug("Result of checksum comparison between src {} and target " ++ "{} : {}", source, target, checksumComparison); +if (checksumComparison.equals(ChecksumComparison.INCOMPATIBLE)) { + return maybeUseModTimeToCompare(source, target); +} +// if skipCrc is disabled and checksumComparison is compatible we +// need not check the mod time. +return checksumComparison.equals(ChecksumComparison.TRUE); } -} else { - return false; } +return false; + } + + /** + * If the mod time comparison is enabled, check the mod time else return + * false. + * Comparison: If the target file perceives to have greater mod time(older) + * than the source file, we can assume that there has been no new changes + * that occurred in the source file, hence we should return true to skip the + * copy of the file. + * @param source Source fileStatus. + * @param target Target fileStatus. + * @return boolean representing result of modTime check. + */ + private boolean maybeUseModTimeToCompare( + CopyListingFileStatus source, FileStatus target) { +if (useModTimeToUpdate) { + return source.getModificationTime() < target.getModificationTime(); Review Comment: hmm, good point. just thinking if there would ever be a scenario when the source file is updated at the same time as it is synced to a different store, so we can have "=" to skip the copy... > Distcp -update between different cloud stores to use modification time while > checking for file skip. > > > Key: HADOOP-18596 > URL: https://issues.apache.org/jira/browse/HADOOP-18596 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Reporter: Mehakmeet Singh >Assignee: Mehakmeet Singh >Priority: Major > Labels: pull-request-available > > Distcp -update currently relies on File size, block size, and Checksum > comparisons to figure out which files should be skipped or copied. > Since different cloud stores have different checksum algorithms we should > check for
[jira] [Commented] (HADOOP-18596) Distcp -update between different cloud stores to use modification time while checking for file skip.
[ https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685980#comment-17685980 ] ASF GitHub Bot commented on HADOOP-18596: - mehakmeet commented on code in PR #5308: URL: https://github.com/apache/hadoop/pull/5308#discussion_r1100412015 ## hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/util/DistCpUtils.java: ## @@ -613,8 +623,12 @@ public static void compareFileLengthsAndChecksums(long srcLen, //At this point, src & dest lengths are same. if length==0, we skip checksum if ((srcLen != 0) && (!skipCrc)) { - if (!checksumsAreEqual(sourceFS, source, sourceChecksum, - targetFS, target, srcLen)) { + CopyMapper.ChecksumComparison + checksumComparison = checksumsAreEqual(sourceFS, source, sourceChecksum, + targetFS, target, srcLen); + // If Checksum comparison is false set it to false, else set to true. + boolean checksumResult = !checksumComparison.equals(CopyMapper.ChecksumComparison.FALSE); Review Comment: We'll be setting "checksumResult" to be true for both "INCOMPATIBLE" and "TRUE" result from checksumsAreEqual() method else false and go through L632, so, we would be following the same flow as before since incompatible result from this method was true earlier too. > Distcp -update between different cloud stores to use modification time while > checking for file skip. > > > Key: HADOOP-18596 > URL: https://issues.apache.org/jira/browse/HADOOP-18596 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Reporter: Mehakmeet Singh >Assignee: Mehakmeet Singh >Priority: Major > Labels: pull-request-available > > Distcp -update currently relies on File size, block size, and Checksum > comparisons to figure out which files should be skipped or copied. > Since different cloud stores have different checksum algorithms we should > check for modification time as well to the checks. > This would ensure that while performing -update if the files are perceived to > be out of sync we should copy them. The machines between which the file > transfers occur should be in time sync to avoid any extra copies. > Improving testing and documentation for modification time checks between > different object stores to ensure no incorrect skipping of files. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-18596) Distcp -update between different cloud stores to use modification time while checking for file skip.
[ https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685976#comment-17685976 ] ASF GitHub Bot commented on HADOOP-18596: - steveloughran commented on code in PR #5308: URL: https://github.com/apache/hadoop/pull/5308#discussion_r1100395250 ## hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/util/DistCpUtils.java: ## @@ -613,8 +623,12 @@ public static void compareFileLengthsAndChecksums(long srcLen, //At this point, src & dest lengths are same. if length==0, we skip checksum if ((srcLen != 0) && (!skipCrc)) { - if (!checksumsAreEqual(sourceFS, source, sourceChecksum, - targetFS, target, srcLen)) { + CopyMapper.ChecksumComparison + checksumComparison = checksumsAreEqual(sourceFS, source, sourceChecksum, + targetFS, target, srcLen); + // If Checksum comparison is false set it to false, else set to true. + boolean checksumResult = !checksumComparison.equals(CopyMapper.ChecksumComparison.FALSE); Review Comment: is this outcome right. as L632 should be reached for any outcome other than True. ## hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/CopyMapper.java: ## @@ -361,31 +373,60 @@ private boolean canSkip(FileSystem sourceFS, CopyListingFileStatus source, if (sameLength && source.getLen() == 0) { return true; } -// if both the source and target have the same length, then check if the -// config to use modification time is set to true, then use the -// modification time and checksum comparison to determine if the copy can -// be skipped else if not set then just use the checksum comparison to -// check copy skip. +// If the src and target file have same size and block size, we would +// check if the checkCrc flag is enabled or not. If enabled, and the +// modTime comparison is enabled then return true if target file is older +// than the source file, since this indicates that the target file is +// recently updated and the source is not changed more recently than the +// update, we can skip the copy else we would copy. +// If skipCrc flag is disabled, we would check the checksum comparison +// which is an enum representing 3 values, of which if the comparison +// returns NOT_COMPATIBLE, we'll try to check modtime again, else return +// the result of checksum comparison which are compatible(true or false). // // Note: Different object stores can have different checksum algorithms // resulting in no checksum comparison that results in return true // always, having the modification time enabled can help in these // scenarios to not incorrectly skip a copy. Refer: HADOOP-18596. + if (sameLength && sameBlockSize) { - if (useModTimeToUpdate) { -return -(source.getModificationTime() < target.getModificationTime()) && -(skipCrc || DistCpUtils.checksumsAreEqual(sourceFS, -source.getPath(), null, -targetFS, target.getPath(), source.getLen())); + if (skipCrc) { +return maybeUseModTimeToCompare(source, target); } else { -return skipCrc || DistCpUtils +ChecksumComparison checksumComparison = DistCpUtils .checksumsAreEqual(sourceFS, source.getPath(), null, targetFS, target.getPath(), source.getLen()); +LOG.debug("Result of checksum comparison between src {} and target " ++ "{} : {}", source, target, checksumComparison); +if (checksumComparison.equals(ChecksumComparison.INCOMPATIBLE)) { + return maybeUseModTimeToCompare(source, target); +} +// if skipCrc is disabled and checksumComparison is compatible we +// need not check the mod time. +return checksumComparison.equals(ChecksumComparison.TRUE); } -} else { - return false; } +return false; + } + + /** + * If the mod time comparison is enabled, check the mod time else return + * false. + * Comparison: If the target file perceives to have greater mod time(older) + * than the source file, we can assume that there has been no new changes + * that occurred in the source file, hence we should return true to skip the + * copy of the file. + * @param source Source fileStatus. + * @param target Target fileStatus. + * @return boolean representing result of modTime check. + */ + private boolean maybeUseModTimeToCompare( + CopyListingFileStatus source, FileStatus target) { +if (useModTimeToUpdate) { + return source.getModificationTime() < target.getModificationTime(); Review Comment: should this be <= ? ## hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/mapred/TestCopyCommitter.java: ## @@ -562,9
[jira] [Commented] (HADOOP-18596) Distcp -update between different cloud stores to use modification time while checking for file skip.
[ https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685929#comment-17685929 ] ASF GitHub Bot commented on HADOOP-18596: - hadoop-yetus commented on PR #5308: URL: https://github.com/apache/hadoop/pull/5308#issuecomment-1422712877 :confetti_ball: **+1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 57s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 1s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 1s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 2 new or modified test files. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 45m 47s | | trunk passed | | +1 :green_heart: | compile | 0m 30s | | trunk passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | compile | 0m 26s | | trunk passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | checkstyle | 0m 27s | | trunk passed | | +1 :green_heart: | mvnsite | 0m 32s | | trunk passed | | +1 :green_heart: | javadoc | 0m 32s | | trunk passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | javadoc | 0m 24s | | trunk passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | spotbugs | 0m 53s | | trunk passed | | +1 :green_heart: | shadedclient | 26m 9s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 0m 30s | | the patch passed | | +1 :green_heart: | compile | 0m 24s | | the patch passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | javac | 0m 24s | | the patch passed | | +1 :green_heart: | compile | 0m 20s | | the patch passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | javac | 0m 20s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 0m 15s | | the patch passed | | +1 :green_heart: | mvnsite | 0m 24s | | the patch passed | | +1 :green_heart: | javadoc | 0m 18s | | the patch passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | javadoc | 0m 18s | | the patch passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | spotbugs | 0m 48s | | the patch passed | | +1 :green_heart: | shadedclient | 25m 50s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 45m 52s | | hadoop-distcp in the patch passed. | | +1 :green_heart: | asflicense | 0m 33s | | The patch does not generate ASF License warnings. | | | | 153m 35s | | | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.42 ServerAPI=1.42 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5308/6/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/5308 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux 84304b0f676c 4.15.0-200-generic #211-Ubuntu SMP Thu Nov 24 18:16:04 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 4ff7f36138039a8ed90a0fd20af0d7b32f5a752e | | Default Java | Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5308/6/testReport/ | | Max. process+thread count | 607 (vs. ulimit of 5500) | | modules | C: hadoop-tools/hadoop-distcp U: hadoop-tools/hadoop-distcp | | Console output | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5308/6/console | | versions | git=2.25.1 maven=3.6.3 spotbugs=4.2.2 | | Powered by | Apache Yetus 0.14.0 https://yetus.apache.org | This message was automatically generated. > Distcp -update between different cloud stores to use modification time while > checking for
[jira] [Commented] (HADOOP-18596) Distcp -update between different cloud stores to use modification time while checking for file skip.
[ https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685926#comment-17685926 ] ASF GitHub Bot commented on HADOOP-18596: - hadoop-yetus commented on PR #5308: URL: https://github.com/apache/hadoop/pull/5308#issuecomment-1422701704 :confetti_ball: **+1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 57s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 2 new or modified test files. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 43m 17s | | trunk passed | | +1 :green_heart: | compile | 0m 34s | | trunk passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | compile | 0m 32s | | trunk passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | checkstyle | 0m 33s | | trunk passed | | +1 :green_heart: | mvnsite | 0m 37s | | trunk passed | | +1 :green_heart: | javadoc | 0m 36s | | trunk passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | javadoc | 0m 30s | | trunk passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | spotbugs | 1m 0s | | trunk passed | | +1 :green_heart: | shadedclient | 23m 33s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 0m 33s | | the patch passed | | +1 :green_heart: | compile | 0m 26s | | the patch passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | javac | 0m 26s | | the patch passed | | +1 :green_heart: | compile | 0m 24s | | the patch passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | javac | 0m 24s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 0m 16s | | the patch passed | | +1 :green_heart: | mvnsite | 0m 27s | | the patch passed | | +1 :green_heart: | javadoc | 0m 20s | | the patch passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | javadoc | 0m 19s | | the patch passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | spotbugs | 0m 50s | | the patch passed | | +1 :green_heart: | shadedclient | 22m 58s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 38m 41s | | hadoop-distcp in the patch passed. | | +1 :green_heart: | asflicense | 0m 38s | | The patch does not generate ASF License warnings. | | | | 139m 35s | | | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.42 ServerAPI=1.42 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5308/7/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/5308 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux 32cba67aa5ec 4.15.0-200-generic #211-Ubuntu SMP Thu Nov 24 18:16:04 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / d64e6b66d4851b711c40dfaf0b9ffc2a8e5a24ac | | Default Java | Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5308/7/testReport/ | | Max. process+thread count | 630 (vs. ulimit of 5500) | | modules | C: hadoop-tools/hadoop-distcp U: hadoop-tools/hadoop-distcp | | Console output | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5308/7/console | | versions | git=2.25.1 maven=3.6.3 spotbugs=4.2.2 | | Powered by | Apache Yetus 0.14.0 https://yetus.apache.org | This message was automatically generated. > Distcp -update between different cloud stores to use modification time while > checking for
[jira] [Commented] (HADOOP-18596) Distcp -update between different cloud stores to use modification time while checking for file skip.
[ https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17684868#comment-17684868 ] ASF GitHub Bot commented on HADOOP-18596: - steveloughran commented on code in PR #5308: URL: https://github.com/apache/hadoop/pull/5308#discussion_r109011 ## hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCpConstants.java: ## @@ -142,6 +142,19 @@ private DistCpConstants() { "distcp.blocks.per.chunk"; public static final String CONF_LABEL_USE_ITERATOR = "distcp.use.iterator"; + + /** + * Enabling distcp -update to use modification time of source and target Review Comment: nit, use {@code distcp -update} for the better formatting ## hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/CopyMapper.java: ## @@ -114,6 +115,8 @@ public void setup(Context context) throws IOException, InterruptedException { PRESERVE_STATUS.getConfigLabel())); directWrite = conf.getBoolean( DistCpOptionSwitch.DIRECT_WRITE.getConfigLabel(), false); +useModTimeToUpdate = +conf.getBoolean(DistCpConstants.CONF_LABEL_UPDATE_MOD_TIME, true); Review Comment: refer to that proposed constant for a default value ## hadoop-tools/hadoop-distcp/src/site/markdown/DistCp.md.vm: ## @@ -631,14 +631,39 @@ hadoop distcp -update -numListstatusThreads 20 \ Because object stores are slow to list files, consider setting the `-numListstatusThreads` option when performing a `-update` operation on a large directory tree (the limit is 40 threads). -When `DistCp -update` is used with object stores, -generally only the modification time and length of the individual files are compared, -not any checksums. The fact that most object stores do have valid timestamps -for directories is irrelevant; only the file timestamps are compared. -However, it is important to have the clock of the client computers close -to that of the infrastructure, so that timestamps are consistent between -the client/HDFS cluster and that of the object store. Otherwise, changed files may be -missed/copied too often. +When `DistCp -update` is used with object stores, generally only the +modification time and length of the individual files are compared, not any +checksums if the checksum algorithm between the two stores is different. + +* The `distcp -update` between two object stores with different checksum + algorithm compares the modification times of source and target files along + with the file size to determine whether to skip the file copy. The behavior + is controlled by the property `distcp.update.modification.time`, which is + set to true by default. If the source file is more recently modified than + the target file, it is assumed that the content has changed, and the file + should be updated. + We need to ensure that there is no clock skew between the machines. + The fact that most object stores do have valid timestamps for directories + is irrelevant; only the file timestamps are compared. However, it is + important to have the clock of the client computers close to that of the + infrastructure, so that timestamps are consistent between the client/HDFS + cluster and that of the object store. Otherwise, changed files may be + missed/copied too often. + +* `distcp.update.modification.time` can be used alongside the checksum check + in stores with same checksum algorithm as well. if set to true we check + both modification time and checksum between the files, but if this property Review Comment: ok. and the default option is "don't use checksums". as i was thinking if we would want to have this on automatically if you are on -skipCrc or the formats are incompatible. but if we leave it something to explicitly ask for, your code looks right ## hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/CopyMapper.java: ## @@ -85,6 +85,7 @@ static enum FileAction { private boolean append = false; private boolean verboseLog = false; private boolean directWrite = false; + private boolean useModTimeToUpdate = true; Review Comment: add a constant for the default value > Distcp -update between different cloud stores to use modification time while > checking for file skip. > > > Key: HADOOP-18596 > URL: https://issues.apache.org/jira/browse/HADOOP-18596 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Reporter: Mehakmeet Singh >Assignee: Mehakmeet Singh >Priority: Major > Labels: pull-request-available > > Distcp -update currently relies on File size, block size, and Checksum > comparisons to figure out which
[jira] [Commented] (HADOOP-18596) Distcp -update between different cloud stores to use modification time while checking for file skip.
[ https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1768#comment-1768 ] ASF GitHub Bot commented on HADOOP-18596: - mehakmeet commented on code in PR #5308: URL: https://github.com/apache/hadoop/pull/5308#discussion_r1096995207 ## hadoop-tools/hadoop-distcp/src/site/markdown/DistCp.md.vm: ## @@ -631,14 +631,39 @@ hadoop distcp -update -numListstatusThreads 20 \ Because object stores are slow to list files, consider setting the `-numListstatusThreads` option when performing a `-update` operation on a large directory tree (the limit is 40 threads). -When `DistCp -update` is used with object stores, -generally only the modification time and length of the individual files are compared, -not any checksums. The fact that most object stores do have valid timestamps -for directories is irrelevant; only the file timestamps are compared. -However, it is important to have the clock of the client computers close -to that of the infrastructure, so that timestamps are consistent between -the client/HDFS cluster and that of the object store. Otherwise, changed files may be -missed/copied too often. +When `DistCp -update` is used with object stores, generally only the +modification time and length of the individual files are compared, not any +checksums if the checksum algorithm between the two stores is different. + +* The `distcp -update` between two object stores with different checksum + algorithm compares the modification times of source and target files along + with the file size to determine whether to skip the file copy. The behavior + is controlled by the property `distcp.update.modification.time`, which is + set to true by default. If the source file is more recently modified than + the target file, it is assumed that the content has changed, and the file + should be updated. + We need to ensure that there is no clock skew between the machines. + The fact that most object stores do have valid timestamps for directories + is irrelevant; only the file timestamps are compared. However, it is + important to have the clock of the client computers close to that of the + infrastructure, so that timestamps are consistent between the client/HDFS + cluster and that of the object store. Otherwise, changed files may be + missed/copied too often. + +* `distcp.update.modification.time` can be used alongside the checksum check + in stores with same checksum algorithm as well. if set to true we check + both modification time and checksum between the files, but if this property Review Comment: The timestamps are only used alongside checksums if we have set the config to true, else we would follow the default way that is offered today(So, we can switch off in cases where we know checksums would work). Since S3A/ABFS has checksums disabled we are returned null for the checksum value, we'll always see true for that case, but it can be true for cases where the checksums actually are identical too, so if we rely on checksum check to be true and then don't compare the timestamp, that can give false skips. So, should we check the timestamps inside of the checksum check instead? Like if the checksums for both source and target are not null and if we have the property set to true then do the mod time check? This would add few more changes as we would need to change the params inside different classes to pass the config value as well. We can always have the default value as false and use the property in the cases we want as well to keep the default way as the one offered today too. > Distcp -update between different cloud stores to use modification time while > checking for file skip. > > > Key: HADOOP-18596 > URL: https://issues.apache.org/jira/browse/HADOOP-18596 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Reporter: Mehakmeet Singh >Assignee: Mehakmeet Singh >Priority: Major > Labels: pull-request-available > > Distcp -update currently relies on File size, block size, and Checksum > comparisons to figure out which files should be skipped or copied. > Since different cloud stores have different checksum algorithms we should > check for modification time as well to the checks. > This would ensure that while performing -update if the files are perceived to > be out of sync we should copy them. The machines between which the file > transfers occur should be in time sync to avoid any extra copies. > Improving testing and documentation for modification time checks between > different object stores to ensure no incorrect skipping of files. -- This message was sent by Atlassian
[jira] [Commented] (HADOOP-18596) Distcp -update between different cloud stores to use modification time while checking for file skip.
[ https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17683527#comment-17683527 ] ASF GitHub Bot commented on HADOOP-18596: - steveloughran commented on code in PR #5308: URL: https://github.com/apache/hadoop/pull/5308#discussion_r1094941352 ## hadoop-tools/hadoop-distcp/src/site/markdown/DistCp.md.vm: ## @@ -631,14 +631,39 @@ hadoop distcp -update -numListstatusThreads 20 \ Because object stores are slow to list files, consider setting the `-numListstatusThreads` option when performing a `-update` operation on a large directory tree (the limit is 40 threads). -When `DistCp -update` is used with object stores, -generally only the modification time and length of the individual files are compared, -not any checksums. The fact that most object stores do have valid timestamps -for directories is irrelevant; only the file timestamps are compared. -However, it is important to have the clock of the client computers close -to that of the infrastructure, so that timestamps are consistent between -the client/HDFS cluster and that of the object store. Otherwise, changed files may be -missed/copied too often. +When `DistCp -update` is used with object stores, generally only the +modification time and length of the individual files are compared, not any +checksums if the checksum algorithm between the two stores is different. + +* The `distcp -update` between two object stores with different checksum + algorithm compares the modification times of source and target files along + with the file size to determine whether to skip the file copy. The behavior + is controlled by the property `distcp.update.modification.time`, which is + set to true by default. If the source file is more recently modified than + the target file, it is assumed that the content has changed, and the file + should be updated. + We need to ensure that there is no clock skew between the machines. + The fact that most object stores do have valid timestamps for directories + is irrelevant; only the file timestamps are compared. However, it is + important to have the clock of the client computers close to that of the + infrastructure, so that timestamps are consistent between the client/HDFS + cluster and that of the object store. Otherwise, changed files may be + missed/copied too often. + +* `distcp.update.modification.time` can be used alongside the checksum check + in stores with same checksum algorithm as well. if set to true we check + both modification time and checksum between the files, but if this property Review Comment: really? I think if checksums are matching then timestamps shouldn't be compared at all. If two files' checksums match, that is sufficient to say "they are the same" > Distcp -update between different cloud stores to use modification time while > checking for file skip. > > > Key: HADOOP-18596 > URL: https://issues.apache.org/jira/browse/HADOOP-18596 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Reporter: Mehakmeet Singh >Assignee: Mehakmeet Singh >Priority: Major > Labels: pull-request-available > > Distcp -update currently relies on File size, block size, and Checksum > comparisons to figure out which files should be skipped or copied. > Since different cloud stores have different checksum algorithms we should > check for modification time as well to the checks. > This would ensure that while performing -update if the files are perceived to > be out of sync we should copy them. The machines between which the file > transfers occur should be in time sync to avoid any extra copies. > Improving testing and documentation for modification time checks between > different object stores to ensure no incorrect skipping of files. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-18596) Distcp -update between different cloud stores to use modification time while checking for file skip.
[ https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17683457#comment-17683457 ] ASF GitHub Bot commented on HADOOP-18596: - hadoop-yetus commented on PR #5308: URL: https://github.com/apache/hadoop/pull/5308#issuecomment-1413886297 :confetti_ball: **+1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 56s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 1s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 1s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 1 new or modified test files. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 45m 38s | | trunk passed | | +1 :green_heart: | compile | 0m 31s | | trunk passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | compile | 0m 26s | | trunk passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | checkstyle | 0m 28s | | trunk passed | | +1 :green_heart: | mvnsite | 0m 31s | | trunk passed | | +1 :green_heart: | javadoc | 0m 31s | | trunk passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | javadoc | 0m 24s | | trunk passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | spotbugs | 0m 55s | | trunk passed | | +1 :green_heart: | shadedclient | 25m 59s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 0m 30s | | the patch passed | | +1 :green_heart: | compile | 0m 24s | | the patch passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | javac | 0m 24s | | the patch passed | | +1 :green_heart: | compile | 0m 22s | | the patch passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | javac | 0m 22s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 0m 14s | | the patch passed | | +1 :green_heart: | mvnsite | 0m 23s | | the patch passed | | +1 :green_heart: | javadoc | 0m 19s | | the patch passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | javadoc | 0m 17s | | the patch passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | spotbugs | 0m 48s | | the patch passed | | +1 :green_heart: | shadedclient | 26m 29s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 46m 10s | | hadoop-distcp in the patch passed. | | +1 :green_heart: | asflicense | 0m 33s | | The patch does not generate ASF License warnings. | | | | 154m 19s | | | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.42 ServerAPI=1.42 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5308/5/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/5308 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux 2a9835850fa1 4.15.0-200-generic #211-Ubuntu SMP Thu Nov 24 18:16:04 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 8c427bd8a37f76e61bf6c8d0d0e7cb364d3a9344 | | Default Java | Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5308/5/testReport/ | | Max. process+thread count | 596 (vs. ulimit of 5500) | | modules | C: hadoop-tools/hadoop-distcp U: hadoop-tools/hadoop-distcp | | Console output | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5308/5/console | | versions | git=2.25.1 maven=3.6.3 spotbugs=4.2.2 | | Powered by | Apache Yetus 0.14.0 https://yetus.apache.org | This message was automatically generated. > Distcp -update between different cloud stores to use modification time while > checking for
[jira] [Commented] (HADOOP-18596) Distcp -update between different cloud stores to use modification time while checking for file skip.
[ https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17683260#comment-17683260 ] ASF GitHub Bot commented on HADOOP-18596: - mehakmeet commented on code in PR #5308: URL: https://github.com/apache/hadoop/pull/5308#discussion_r1094170778 ## hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/contract/AbstractContractDistCpTest.java: ## @@ -50,6 +50,7 @@ import org.apache.hadoop.tools.mapred.CopyMapper; import org.apache.hadoop.tools.util.DistCpTestUtils; import org.apache.hadoop.util.functional.RemoteIterators; +import org.apache.http.annotation.Contract; Review Comment: unused import. removing this and another one caught in the checkstyle test > Distcp -update between different cloud stores to use modification time while > checking for file skip. > > > Key: HADOOP-18596 > URL: https://issues.apache.org/jira/browse/HADOOP-18596 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Reporter: Mehakmeet Singh >Assignee: Mehakmeet Singh >Priority: Major > Labels: pull-request-available > > Distcp -update currently relies on File size, block size, and Checksum > comparisons to figure out which files should be skipped or copied. > Since different cloud stores have different checksum algorithms we should > check for modification time as well to the checks. > This would ensure that while performing -update if the files are perceived to > be out of sync we should copy them. The machines between which the file > transfers occur should be in time sync to avoid any extra copies. > Improving testing and documentation for modification time checks between > different object stores to ensure no incorrect skipping of files. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-18596) Distcp -update between different cloud stores to use modification time while checking for file skip.
[ https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17683091#comment-17683091 ] ASF GitHub Bot commented on HADOOP-18596: - steveloughran commented on code in PR #5308: URL: https://github.com/apache/hadoop/pull/5308#discussion_r1093469149 ## hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/contract/AbstractContractDistCpTest.java: ## @@ -868,35 +892,41 @@ public void testDistCpUpdateCheckFileSkip() throws Exception { Path source = new Path(remoteDir, "file"); Path dest = new Path(localDir, "file"); + +Path source_0byte = new Path(remoteDir, "file_0byte"); +Path dest_0byte = new Path(localDir, "file_0byte"); dest = localFS.makeQualified(dest); +dest_0byte = localFS.makeQualified(dest_0byte); // Creating a source file with certain dataset. byte[] sourceBlock = dataset(10, 'a', 'z'); // Write the dataset and as well create the target path. -try (FSDataOutputStream out = remoteFS.create(source)) { - out.write(sourceBlock); - localFS.create(dest); -} +ContractTestUtils.createFile(localFS, dest, true, sourceBlock); +ContractTestUtils +.writeDataset(remoteFS, source, sourceBlock, sourceBlock.length, +1024, true); -verifyPathExists(remoteFS, "", source); -verifyPathExists(localFS, "", dest); -DistCpTestUtils -.assertRunDistCp(DistCpConstants.SUCCESS, remoteDir.toString(), -localDir.toString(), "-delete -update" + getDefaultCLIOptions(), -conf); +// Create 0 byte source and target files. +ContractTestUtils.createFile(remoteFS, source_0byte, true, new byte[0]); +ContractTestUtils.createFile(localFS, dest_0byte, true, new byte[0]); + +// Execute the distcp -update job. +Job job = distCpUpdateWithFs(remoteDir, localDir, remoteFS, localFS); // First distcp -update would normally copy the source to dest. verifyFileContents(localFS, dest, sourceBlock); +// Verify 1 file was skipped in the distcp -update(They 0 byte files). Review Comment: nit: add space after update and replace They with `the` ## hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/contract/AbstractContractDistCpTest.java: ## @@ -50,6 +50,7 @@ import org.apache.hadoop.tools.mapred.CopyMapper; import org.apache.hadoop.tools.util.DistCpTestUtils; import org.apache.hadoop.util.functional.RemoteIterators; +import org.apache.http.annotation.Contract; Review Comment: where does this come from/get used? ## hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/contract/AbstractContractDistCpTest.java: ## @@ -913,32 +943,65 @@ public void testDistCpUpdateCheckFileSkip() throws Exception { long newTargetModTimeNew = modTimeSourceUpd + MODIFICATION_TIME_OFFSET; localFS.setTimes(dest, newTargetModTimeNew, -1); -DistCpTestUtils -.assertRunDistCp(DistCpConstants.SUCCESS, remoteDir.toString(), -localDir.toString(), "-delete -update" + getDefaultCLIOptions(), -conf); +// Execute the distcp -update job. +Job updatedSourceJobOldSrc = +distCpUpdateWithFs(remoteDir, localDir, remoteFS, +localFS); // File contents should remain same since the mod time for target is // newer than the updatedSource which indicates that the sync happened // more recently and there is no update. verifyFileContents(localFS, dest, sourceBlock); +// Skipped both 0 byte file and sourceFile(since mod time of target is +// older than the source it is perceived that source is of older version +// and we can skip it's copy). +verifySkipAndCopyCounter(updatedSourceJobOldSrc, 2, 0); // Subtract by an offset which would ensure enough gap for the test to // not fail due to race conditions. long newTargetModTimeOld = Math.min(modTimeSourceUpd - MODIFICATION_TIME_OFFSET, 0); localFS.setTimes(dest, newTargetModTimeOld, -1); -DistCpTestUtils -.assertRunDistCp(DistCpConstants.SUCCESS, remoteDir.toString(), -localDir.toString(), "-delete -update" + getDefaultCLIOptions(), -conf); +// Execute the distcp -update job. +Job updatedSourceJobNewSrc = distCpUpdateWithFs(remoteDir, localDir, +remoteFS, +localFS); -Assertions.assertThat(RemoteIterators.toList(localFS.listFiles(dest, true))) -.hasSize(1); +// Verifying the target directory have both 0 byte file and the content +// file. +Assertions +.assertThat(RemoteIterators.toList(localFS.listFiles(localDir, true))) +.hasSize(2); // Now the copy should take place and the file contents should change // since the mod time for target is older than the source file indicating // that there was an update to the
[jira] [Commented] (HADOOP-18596) Distcp -update between different cloud stores to use modification time while checking for file skip.
[ https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17683090#comment-17683090 ] ASF GitHub Bot commented on HADOOP-18596: - steveloughran commented on code in PR #5308: URL: https://github.com/apache/hadoop/pull/5308#discussion_r1093464483 ## hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/contract/AbstractContractDistCpTest.java: ## @@ -857,4 +862,83 @@ public void testDistCpWithUpdateExistFile() throws Exception { verifyFileContents(localFS, dest, block); } + @Test + public void testDistCpUpdateCheckFileSkip() throws Exception { Review Comment: oh, I didn't believe that was the case. I see the javadocs for it say we just skip the big file tests. i'm happy > Distcp -update between different cloud stores to use modification time while > checking for file skip. > > > Key: HADOOP-18596 > URL: https://issues.apache.org/jira/browse/HADOOP-18596 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Reporter: Mehakmeet Singh >Assignee: Mehakmeet Singh >Priority: Major > Labels: pull-request-available > > Distcp -update currently relies on File size, block size, and Checksum > comparisons to figure out which files should be skipped or copied. > Since different cloud stores have different checksum algorithms we should > check for modification time as well to the checks. > This would ensure that while performing -update if the files are perceived to > be out of sync we should copy them. The machines between which the file > transfers occur should be in time sync to avoid any extra copies. > Improving testing and documentation for modification time checks between > different object stores to ensure no incorrect skipping of files. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-18596) Distcp -update between different cloud stores to use modification time while checking for file skip.
[ https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17683087#comment-17683087 ] ASF GitHub Bot commented on HADOOP-18596: - steveloughran commented on code in PR #5308: URL: https://github.com/apache/hadoop/pull/5308#discussion_r1093465515 ## hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCpConstants.java: ## @@ -142,6 +142,13 @@ private DistCpConstants() { "distcp.blocks.per.chunk"; public static final String CONF_LABEL_USE_ITERATOR = "distcp.use.iterator"; + + /** Distcp -update to use modification time of source and target file to + * check while skipping. + */ + public static final String CONF_LABEL_UPDATE_MOD_TIME = Review Comment: 1. javadoc to include {@value} 2. distcp docs to explain the option > Distcp -update between different cloud stores to use modification time while > checking for file skip. > > > Key: HADOOP-18596 > URL: https://issues.apache.org/jira/browse/HADOOP-18596 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Reporter: Mehakmeet Singh >Assignee: Mehakmeet Singh >Priority: Major > Labels: pull-request-available > > Distcp -update currently relies on File size, block size, and Checksum > comparisons to figure out which files should be skipped or copied. > Since different cloud stores have different checksum algorithms we should > check for modification time as well to the checks. > This would ensure that while performing -update if the files are perceived to > be out of sync we should copy them. The machines between which the file > transfers occur should be in time sync to avoid any extra copies. > Improving testing and documentation for modification time checks between > different object stores to ensure no incorrect skipping of files. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-18596) Distcp -update between different cloud stores to use modification time while checking for file skip.
[ https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17683085#comment-17683085 ] ASF GitHub Bot commented on HADOOP-18596: - steveloughran commented on code in PR #5308: URL: https://github.com/apache/hadoop/pull/5308#discussion_r1093464483 ## hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/contract/AbstractContractDistCpTest.java: ## @@ -857,4 +862,83 @@ public void testDistCpWithUpdateExistFile() throws Exception { verifyFileContents(localFS, dest, block); } + @Test + public void testDistCpUpdateCheckFileSkip() throws Exception { Review Comment: oh, I didn't believe that was the case. I see the Dox say we just skipped the big files there. If that test is happy then so am I. > Distcp -update between different cloud stores to use modification time while > checking for file skip. > > > Key: HADOOP-18596 > URL: https://issues.apache.org/jira/browse/HADOOP-18596 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Reporter: Mehakmeet Singh >Assignee: Mehakmeet Singh >Priority: Major > Labels: pull-request-available > > Distcp -update currently relies on File size, block size, and Checksum > comparisons to figure out which files should be skipped or copied. > Since different cloud stores have different checksum algorithms we should > check for modification time as well to the checks. > This would ensure that while performing -update if the files are perceived to > be out of sync we should copy them. The machines between which the file > transfers occur should be in time sync to avoid any extra copies. > Improving testing and documentation for modification time checks between > different object stores to ensure no incorrect skipping of files. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-18596) Distcp -update between different cloud stores to use modification time while checking for file skip.
[ https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17683009#comment-17683009 ] ASF GitHub Bot commented on HADOOP-18596: - hadoop-yetus commented on PR #5308: URL: https://github.com/apache/hadoop/pull/5308#issuecomment-1412003628 :confetti_ball: **+1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 17m 41s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 1 new or modified test files. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 48m 51s | | trunk passed | | +1 :green_heart: | compile | 0m 30s | | trunk passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | compile | 0m 26s | | trunk passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | checkstyle | 0m 26s | | trunk passed | | +1 :green_heart: | mvnsite | 0m 31s | | trunk passed | | +1 :green_heart: | javadoc | 0m 31s | | trunk passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | javadoc | 0m 24s | | trunk passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | spotbugs | 0m 55s | | trunk passed | | +1 :green_heart: | shadedclient | 25m 40s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 0m 30s | | the patch passed | | +1 :green_heart: | compile | 0m 24s | | the patch passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | javac | 0m 24s | | the patch passed | | +1 :green_heart: | compile | 0m 20s | | the patch passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | javac | 0m 20s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | -0 :warning: | checkstyle | 0m 14s | [/results-checkstyle-hadoop-tools_hadoop-distcp.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5308/4/artifact/out/results-checkstyle-hadoop-tools_hadoop-distcp.txt) | hadoop-tools/hadoop-distcp: The patch generated 4 new + 14 unchanged - 0 fixed = 18 total (was 14) | | +1 :green_heart: | mvnsite | 0m 23s | | the patch passed | | +1 :green_heart: | javadoc | 0m 18s | | the patch passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | javadoc | 0m 16s | | the patch passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | spotbugs | 0m 49s | | the patch passed | | +1 :green_heart: | shadedclient | 26m 6s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 46m 9s | | hadoop-distcp in the patch passed. | | +1 :green_heart: | asflicense | 0m 33s | | The patch does not generate ASF License warnings. | | | | 173m 21s | | | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5308/4/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/5308 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux e05c0db28e77 4.15.0-200-generic #211-Ubuntu SMP Thu Nov 24 18:16:04 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / f094248e19b615310d4791e31d293b88ed6bde37 | | Default Java | Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5308/4/testReport/ | | Max. process+thread count | 538 (vs. ulimit of 5500) | | modules | C: hadoop-tools/hadoop-distcp U: hadoop-tools/hadoop-distcp | | Console output | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5308/4/console | | versions |
[jira] [Commented] (HADOOP-18596) Distcp -update between different cloud stores to use modification time while checking for file skip.
[ https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17682933#comment-17682933 ] ASF GitHub Bot commented on HADOOP-18596: - mehakmeet commented on code in PR #5308: URL: https://github.com/apache/hadoop/pull/5308#discussion_r1092974460 ## hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/contract/AbstractContractDistCpTest.java: ## @@ -857,4 +862,83 @@ public void testDistCpWithUpdateExistFile() throws Exception { verifyFileContents(localFS, dest, block); } + @Test + public void testDistCpUpdateCheckFileSkip() throws Exception { Review Comment: There is `TestHDFSContractDistCp.java` that does implement this suite. I have run it before and it works successfully. Let me know if we still need to add another one specifically for hdfs. > Distcp -update between different cloud stores to use modification time while > checking for file skip. > > > Key: HADOOP-18596 > URL: https://issues.apache.org/jira/browse/HADOOP-18596 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Reporter: Mehakmeet Singh >Assignee: Mehakmeet Singh >Priority: Major > Labels: pull-request-available > > Distcp -update currently relies on File size, block size, and Checksum > comparisons to figure out which files should be skipped or copied. > Since different cloud stores have different checksum algorithms we should > check for modification time as well to the checks. > This would ensure that while performing -update if the files are perceived to > be out of sync we should copy them. The machines between which the file > transfers occur should be in time sync to avoid any extra copies. > Improving testing and documentation for modification time checks between > different object stores to ensure no incorrect skipping of files. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-18596) Distcp -update between different cloud stores to use modification time while checking for file skip.
[ https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17682900#comment-17682900 ] ASF GitHub Bot commented on HADOOP-18596: - mehakmeet commented on code in PR #5308: URL: https://github.com/apache/hadoop/pull/5308#discussion_r1092876062 ## hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCpConstants.java: ## @@ -142,6 +142,13 @@ private DistCpConstants() { "distcp.blocks.per.chunk"; public static final String CONF_LABEL_USE_ITERATOR = "distcp.use.iterator"; + + /** Distcp -update to use modification time of source and target file to + * check while skipping. + */ + public static final String CONF_LABEL_UPDATE_MOD_TIME = Review Comment: Not sure if you mean ".md" file modification or the modification to already added java docs? I'll add some more info to the javadocs about the constant. > Distcp -update between different cloud stores to use modification time while > checking for file skip. > > > Key: HADOOP-18596 > URL: https://issues.apache.org/jira/browse/HADOOP-18596 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Reporter: Mehakmeet Singh >Assignee: Mehakmeet Singh >Priority: Major > Labels: pull-request-available > > Distcp -update currently relies on File size, block size, and Checksum > comparisons to figure out which files should be skipped or copied. > Since different cloud stores have different checksum algorithms we should > check for modification time as well to the checks. > This would ensure that while performing -update if the files are perceived to > be out of sync we should copy them. The machines between which the file > transfers occur should be in time sync to avoid any extra copies. > Improving testing and documentation for modification time checks between > different object stores to ensure no incorrect skipping of files. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-18596) Distcp -update between different cloud stores to use modification time while checking for file skip.
[ https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17682172#comment-17682172 ] ASF GitHub Bot commented on HADOOP-18596: - steveloughran commented on code in PR #5308: URL: https://github.com/apache/hadoop/pull/5308#discussion_r1090843134 ## hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCpConstants.java: ## @@ -142,6 +142,13 @@ private DistCpConstants() { "distcp.blocks.per.chunk"; public static final String CONF_LABEL_USE_ITERATOR = "distcp.use.iterator"; + + /** Distcp -update to use modification time of source and target file to + * check while skipping. + */ + public static final String CONF_LABEL_UPDATE_MOD_TIME = Review Comment: going to need docs. ## hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/contract/AbstractContractDistCpTest.java: ## @@ -857,4 +862,83 @@ public void testDistCpWithUpdateExistFile() throws Exception { verifyFileContents(localFS, dest, block); } + @Test + public void testDistCpUpdateCheckFileSkip() throws Exception { Review Comment: I'm thinking of a way to test that 0 byte files don't get copied the testUpdateDeepDirectoryStructureNoChange() test shows how the counters are used for validation. The new test should validate the files are skipped as well as checking the contents. That should be usable to verify that 0 byte files are always skipped, something we can't do with content validation. one thing to be aware of is that this test suite isn't implemented by hdfs, because of the way it creates a new fs every call is too slow. there should be some specific hdfs to local test we should cover too. ## hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/contract/AbstractContractDistCpTest.java: ## @@ -857,4 +862,83 @@ public void testDistCpWithUpdateExistFile() throws Exception { verifyFileContents(localFS, dest, block); } + @Test + public void testDistCpUpdateCheckFileSkip() throws Exception { +describe("Distcp update to check file skips."); + +Path source = new Path(remoteDir, "file"); +Path dest = new Path(localDir, "file"); +dest = localFS.makeQualified(dest); + +// Creating a source file with certain dataset. +byte[] sourceBlock = dataset(10, 'a', 'z'); + +// Write the dataset and as well create the target path. +try (FSDataOutputStream out = remoteFS.create(source)) { Review Comment: if you use `ContractTestUtils.writeDataset()` here the write is followed by the check that the file is of the correct length; L882 just looks for existence > Distcp -update between different cloud stores to use modification time while > checking for file skip. > > > Key: HADOOP-18596 > URL: https://issues.apache.org/jira/browse/HADOOP-18596 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Reporter: Mehakmeet Singh >Assignee: Mehakmeet Singh >Priority: Major > Labels: pull-request-available > > Distcp -update currently relies on File size, block size, and Checksum > comparisons to figure out which files should be skipped or copied. > Since different cloud stores have different checksum algorithms we should > check for modification time as well to the checks. > This would ensure that while performing -update if the files are perceived to > be out of sync we should copy them. The machines between which the file > transfers occur should be in time sync to avoid any extra copies. > Improving testing and documentation for modification time checks between > different object stores to ensure no incorrect skipping of files. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-18596) Distcp -update between different cloud stores to use modification time while checking for file skip.
[ https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17682021#comment-17682021 ] ASF GitHub Bot commented on HADOOP-18596: - hadoop-yetus commented on PR #5308: URL: https://github.com/apache/hadoop/pull/5308#issuecomment-1408457069 :confetti_ball: **+1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 57s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 1s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 1s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 1 new or modified test files. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 46m 19s | | trunk passed | | +1 :green_heart: | compile | 0m 31s | | trunk passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | compile | 0m 27s | | trunk passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | checkstyle | 0m 27s | | trunk passed | | +1 :green_heart: | mvnsite | 0m 31s | | trunk passed | | +1 :green_heart: | javadoc | 0m 31s | | trunk passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | javadoc | 0m 25s | | trunk passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | spotbugs | 0m 56s | | trunk passed | | +1 :green_heart: | shadedclient | 26m 4s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 0m 30s | | the patch passed | | +1 :green_heart: | compile | 0m 24s | | the patch passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | javac | 0m 24s | | the patch passed | | +1 :green_heart: | compile | 0m 21s | | the patch passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | javac | 0m 21s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 0m 14s | | the patch passed | | +1 :green_heart: | mvnsite | 0m 23s | | the patch passed | | +1 :green_heart: | javadoc | 0m 18s | | the patch passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | javadoc | 0m 17s | | the patch passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | spotbugs | 0m 50s | | the patch passed | | +1 :green_heart: | shadedclient | 26m 4s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 46m 3s | | hadoop-distcp in the patch passed. | | +1 :green_heart: | asflicense | 0m 32s | | The patch does not generate ASF License warnings. | | | | 154m 24s | | | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5308/3/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/5308 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux 33f023938bd4 4.15.0-200-generic #211-Ubuntu SMP Thu Nov 24 18:16:04 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / d23f13be729f0fe0770ce856b4f676f6782d83b9 | | Default Java | Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5308/3/testReport/ | | Max. process+thread count | 540 (vs. ulimit of 5500) | | modules | C: hadoop-tools/hadoop-distcp U: hadoop-tools/hadoop-distcp | | Console output | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5308/3/console | | versions | git=2.25.1 maven=3.6.3 spotbugs=4.2.2 | | Powered by | Apache Yetus 0.14.0 https://yetus.apache.org | This message was automatically generated. > Distcp -update between different cloud stores to use modification time while > checking for
[jira] [Commented] (HADOOP-18596) Distcp -update between different cloud stores to use modification time while checking for file skip.
[ https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17679756#comment-17679756 ] ASF GitHub Bot commented on HADOOP-18596: - hadoop-yetus commented on PR #5308: URL: https://github.com/apache/hadoop/pull/5308#issuecomment-1400145282 :confetti_ball: **+1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 59s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 1s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 1 new or modified test files. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 46m 28s | | trunk passed | | +1 :green_heart: | compile | 0m 31s | | trunk passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | compile | 0m 27s | | trunk passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | checkstyle | 0m 28s | | trunk passed | | +1 :green_heart: | mvnsite | 0m 32s | | trunk passed | | +1 :green_heart: | javadoc | 0m 32s | | trunk passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | javadoc | 0m 24s | | trunk passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | spotbugs | 0m 57s | | trunk passed | | +1 :green_heart: | shadedclient | 26m 31s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 0m 31s | | the patch passed | | +1 :green_heart: | compile | 0m 24s | | the patch passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | javac | 0m 24s | | the patch passed | | +1 :green_heart: | compile | 0m 22s | | the patch passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | javac | 0m 22s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 0m 15s | | the patch passed | | +1 :green_heart: | mvnsite | 0m 23s | | the patch passed | | +1 :green_heart: | javadoc | 0m 19s | | the patch passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | javadoc | 0m 17s | | the patch passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | spotbugs | 0m 48s | | the patch passed | | +1 :green_heart: | shadedclient | 25m 58s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 45m 26s | | hadoop-distcp in the patch passed. | | +1 :green_heart: | asflicense | 0m 33s | | The patch does not generate ASF License warnings. | | | | 154m 21s | | | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5308/2/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/5308 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux 96f6b364be34 4.15.0-200-generic #211-Ubuntu SMP Thu Nov 24 18:16:04 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / ee9a8568ae9e97cb94a05bbe1b2191811e0d45ee | | Default Java | Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5308/2/testReport/ | | Max. process+thread count | 616 (vs. ulimit of 5500) | | modules | C: hadoop-tools/hadoop-distcp U: hadoop-tools/hadoop-distcp | | Console output | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5308/2/console | | versions | git=2.25.1 maven=3.6.3 spotbugs=4.2.2 | | Powered by | Apache Yetus 0.14.0 https://yetus.apache.org | This message was automatically generated. > Distcp -update between different cloud stores to use modification time while > checking for
[jira] [Commented] (HADOOP-18596) Distcp -update between different cloud stores to use modification time while checking for file skip.
[ https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17678783#comment-17678783 ] ASF GitHub Bot commented on HADOOP-18596: - mehakmeet commented on code in PR #5308: URL: https://github.com/apache/hadoop/pull/5308#discussion_r1081471243 ## hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/CopyMapper.java: ## @@ -354,7 +354,14 @@ private boolean canSkip(FileSystem sourceFS, CopyListingFileStatus source, boolean sameLength = target.getLen() == source.getLen(); boolean sameBlockSize = source.getBlockSize() == target.getBlockSize() || !preserve.contains(FileAttribute.BLOCKSIZE); -if (sameLength && sameBlockSize) { +// checksum check to be done if same file len(greater than 0), same block +// size and the target file has been updated more recently than the source +// file. +// Note: For Different cloud stores with different checksum algorithms, +// checksum comparisons are not performed so we would be depending on the +// file size and modification time. +if (sameLength && (source.getLen() > 0) && sameBlockSize && +source.getModificationTime() < target.getModificationTime()) { Review Comment: Ah, I actually had to add a check of if the file size is 0 to skip it every time before this check, forgot to add it in this version locally . Good catch. > Distcp -update between different cloud stores to use modification time while > checking for file skip. > > > Key: HADOOP-18596 > URL: https://issues.apache.org/jira/browse/HADOOP-18596 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Reporter: Mehakmeet Singh >Assignee: Mehakmeet Singh >Priority: Major > Labels: pull-request-available > > Distcp -update currently relies on File size, block size, and Checksum > comparisons to figure out which files should be skipped or copied. > Since different cloud stores have different checksum algorithms we should > check for modification time as well to the checks. > This would ensure that while performing -update if the files are perceived to > be out of sync we should copy them. The machines between which the file > transfers occur should be in time sync to avoid any extra copies. > Improving testing and documentation for modification time checks between > different object stores to ensure no incorrect skipping of files. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-18596) Distcp -update between different cloud stores to use modification time while checking for file skip.
[ https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17678780#comment-17678780 ] Daniel Carl Jones commented on HADOOP-18596: {quote}What Mehakmeet proposes is possible, doesn't add any risk of reduced copy (only increased copies) and fairly easy to test. {quote} So long as we meet this, i.e. we only potentially cause more files to be included in the update, then this change seems fine. Some users may find more files being copied than usual, but they are already exposed to the risk of newer safe length files not being copied when they should have been - will communicating this bug fix in change notes be enough? {quote}We should look out that there shouldn't be a massive difference between the clocks so that the updation of the source files from one version to another should be more recent than the previous version being synced to cloud storage for example. {quote} Related to this - any way we can have DistCp abort the copy if it detects the source and destination are drifted beyond some acceptable threshold? Perhaps a separate Jira if it is a feasible check to add. > Distcp -update between different cloud stores to use modification time while > checking for file skip. > > > Key: HADOOP-18596 > URL: https://issues.apache.org/jira/browse/HADOOP-18596 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Reporter: Mehakmeet Singh >Assignee: Mehakmeet Singh >Priority: Major > Labels: pull-request-available > > Distcp -update currently relies on File size, block size, and Checksum > comparisons to figure out which files should be skipped or copied. > Since different cloud stores have different checksum algorithms we should > check for modification time as well to the checks. > This would ensure that while performing -update if the files are perceived to > be out of sync we should copy them. The machines between which the file > transfers occur should be in time sync to avoid any extra copies. > Improving testing and documentation for modification time checks between > different object stores to ensure no incorrect skipping of files. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-18596) Distcp -update between different cloud stores to use modification time while checking for file skip.
[ https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17678768#comment-17678768 ] ASF GitHub Bot commented on HADOOP-18596: - dannycjones commented on code in PR #5308: URL: https://github.com/apache/hadoop/pull/5308#discussion_r1081452359 ## hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/CopyMapper.java: ## @@ -354,7 +354,14 @@ private boolean canSkip(FileSystem sourceFS, CopyListingFileStatus source, boolean sameLength = target.getLen() == source.getLen(); boolean sameBlockSize = source.getBlockSize() == target.getBlockSize() || !preserve.contains(FileAttribute.BLOCKSIZE); -if (sameLength && sameBlockSize) { +// checksum check to be done if same file len(greater than 0), same block +// size and the target file has been updated more recently than the source +// file. +// Note: For Different cloud stores with different checksum algorithms, +// checksum comparisons are not performed so we would be depending on the +// file size and modification time. +if (sameLength && (source.getLen() > 0) && sameBlockSize && +source.getModificationTime() < target.getModificationTime()) { Review Comment: Why the addition of the `getLen() > 0`? We want to always copy if its an empty file? > Distcp -update between different cloud stores to use modification time while > checking for file skip. > > > Key: HADOOP-18596 > URL: https://issues.apache.org/jira/browse/HADOOP-18596 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Reporter: Mehakmeet Singh >Assignee: Mehakmeet Singh >Priority: Major > Labels: pull-request-available > > Distcp -update currently relies on File size, block size, and Checksum > comparisons to figure out which files should be skipped or copied. > Since different cloud stores have different checksum algorithms we should > check for modification time as well to the checks. > This would ensure that while performing -update if the files are perceived to > be out of sync we should copy them. The machines between which the file > transfers occur should be in time sync to avoid any extra copies. > Improving testing and documentation for modification time checks between > different object stores to ensure no incorrect skipping of files. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-18596) Distcp -update between different cloud stores to use modification time while checking for file skip.
[ https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17678223#comment-17678223 ] ASF GitHub Bot commented on HADOOP-18596: - hadoop-yetus commented on PR #5308: URL: https://github.com/apache/hadoop/pull/5308#issuecomment-1387042961 :confetti_ball: **+1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 54s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 1 new or modified test files. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 46m 21s | | trunk passed | | +1 :green_heart: | compile | 0m 30s | | trunk passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | compile | 0m 26s | | trunk passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | checkstyle | 0m 27s | | trunk passed | | +1 :green_heart: | mvnsite | 0m 30s | | trunk passed | | +1 :green_heart: | javadoc | 0m 31s | | trunk passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | javadoc | 0m 25s | | trunk passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | spotbugs | 0m 57s | | trunk passed | | +1 :green_heart: | shadedclient | 25m 59s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 0m 31s | | the patch passed | | +1 :green_heart: | compile | 0m 25s | | the patch passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | javac | 0m 25s | | the patch passed | | +1 :green_heart: | compile | 0m 20s | | the patch passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | javac | 0m 20s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | -0 :warning: | checkstyle | 0m 15s | [/results-checkstyle-hadoop-tools_hadoop-distcp.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5308/1/artifact/out/results-checkstyle-hadoop-tools_hadoop-distcp.txt) | hadoop-tools/hadoop-distcp: The patch generated 4 new + 9 unchanged - 0 fixed = 13 total (was 9) | | +1 :green_heart: | mvnsite | 0m 24s | | the patch passed | | +1 :green_heart: | javadoc | 0m 19s | | the patch passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | javadoc | 0m 17s | | the patch passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | spotbugs | 0m 49s | | the patch passed | | +1 :green_heart: | shadedclient | 26m 15s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 44m 54s | | hadoop-distcp in the patch passed. | | +1 :green_heart: | asflicense | 0m 32s | | The patch does not generate ASF License warnings. | | | | 153m 18s | | | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5308/1/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/5308 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux d9de88482ed7 4.15.0-200-generic #211-Ubuntu SMP Thu Nov 24 18:16:04 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 5d5228db519f0cc615c4955ba36b9f3ee0572788 | | Default Java | Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5308/1/testReport/ | | Max. process+thread count | 636 (vs. ulimit of 5500) | | modules | C: hadoop-tools/hadoop-distcp U: hadoop-tools/hadoop-distcp | | Console output | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5308/1/console | | versions |
[jira] [Commented] (HADOOP-18596) Distcp -update between different cloud stores to use modification time while checking for file skip.
[ https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17678206#comment-17678206 ] ASF GitHub Bot commented on HADOOP-18596: - mehakmeet commented on PR #5308: URL: https://github.com/apache/hadoop/pull/5308#issuecomment-1386942938 CC: @steveloughran @mukund-thakur > Distcp -update between different cloud stores to use modification time while > checking for file skip. > > > Key: HADOOP-18596 > URL: https://issues.apache.org/jira/browse/HADOOP-18596 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Reporter: Mehakmeet Singh >Assignee: Mehakmeet Singh >Priority: Major > Labels: pull-request-available > > Distcp -update currently relies on File size, block size, and Checksum > comparisons to figure out which files should be skipped or copied. > Since different cloud stores have different checksum algorithms we should > check for modification time as well to the checks. > This would ensure that while performing -update if the files are perceived to > be out of sync we should copy them. The machines between which the file > transfers occur should be in time sync to avoid any extra copies. > Improving testing and documentation for modification time checks between > different object stores to ensure no incorrect skipping of files. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-18596) Distcp -update between different cloud stores to use modification time while checking for file skip.
[ https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17678178#comment-17678178 ] ASF GitHub Bot commented on HADOOP-18596: - mehakmeet opened a new pull request, #5308: URL: https://github.com/apache/hadoop/pull/5308 ### Description of PR Using modification time as a way to add more checks to determine if distcp -update should skip a file or not. In specific cases like the same file name, and size but different content we used to incorrectly skip files in update since there is no checksum comparison between object stores with different algorithm for it, to mitigate this we introduce comparing modification time between the target file and the source. ### How was this patch tested? Manually tested on an environment after reproducing the scenario where we might incorrectly skip a file. Added a test in `AbstractContractDistCpTest.java` to test by changing the target file's modification time to emulate the scenario. Tested on S3A(ap-south-1), ABFS(us-west-2), and LocalFS, and the test was successful. ### For code changes: - [X] Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')? - [X] Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation? - [X] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)? - [X] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, `NOTICE-binary` files? > Distcp -update between different cloud stores to use modification time while > checking for file skip. > > > Key: HADOOP-18596 > URL: https://issues.apache.org/jira/browse/HADOOP-18596 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Reporter: Mehakmeet Singh >Assignee: Mehakmeet Singh >Priority: Major > > Distcp -update currently relies on File size, block size, and Checksum > comparisons to figure out which files should be skipped or copied. > Since different cloud stores have different checksum algorithms we should > check for modification time as well to the checks. > This would ensure that while performing -update if the files are perceived to > be out of sync we should copy them. The machines between which the file > transfers occur should be in time sync to avoid any extra copies. > Improving testing and documentation for modification time checks between > different object stores to ensure no incorrect skipping of files. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-18596) Distcp -update between different cloud stores to use modification time while checking for file skip.
[ https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17677789#comment-17677789 ] Steve Loughran commented on HADOOP-18596: - note we already state in the docs that modtime i used. it's just the author of that paragraph (me) was misinformed. expecting clocks to be in sync is unrealistic. at least the modtime is configured to to be UTC everywhere, so there's no time zone conversion -provided this requirement is met everywhere. I am confident the big cloud vendors get it right (our tests would probably have caught this by now), but private minio deployments may be misconfigured with both NTP and tz. Mehakmeet's proposal will not cause copies which would not be skipped today to be skipped with the patch. what it will do is cause updates where the file length is the same to now be copied if source time > dest time. The worst case then is "not all updated files are detected". note that this will also address cross-EZ copies better, because there the hdfs cluster will be in 100% sync. same for copies within the same s3/azure/gcs store but within the same fs uri or across containers/buckets/accounts. The way to do this *properly* would be to log the checksum/etag of the sauce and update if that is different from the last upload. They'll be no need to check the destination at all, assuming the workflow is nothing but a chain of distcp++ jobs. Something like that would be a complete rewrite and I have no enthusiasm for that. FWIW I have played with using spark for a distcp successor https://github.com/hortonworks-spark/cloud-integration/blob/master/spark-cloud-integration/src/main/scala/com/cloudera/spark/cloud/applications/CloudCp.scala If I was going to replace distcp I'd do it that way * modern execution env for dynamically passing work around * rate throttling across job (allocate capacity to each worker process, share that across all active threads) * good view of progress * could provide an API to take any RDD as a source of the list of files to upload * IOStats can be collected and marshalled back from workers to driver * generate avro summary of the update which can then be converted into human reports. I'm not going to go there. One challenge is actually recovering from failure of the job as a complete restart would copy up all files for which the summary .avro file hasn't yet been generated. you'd actually want to commit the summary of each task attempt *in task commit* so that a new job would be able to pick it up and continue. mapreduce AM restart does this automatically, but not spark. Then there's all the ideas from Apache Gobblin. A distcp successor would be a massive undertaking and doesn't need to be in the hadoop modules. What Mehakmeet proposes is possible, doesn't add any risk of reduced copy (only increased copies) and fairly easy to test. > Distcp -update between different cloud stores to use modification time while > checking for file skip. > > > Key: HADOOP-18596 > URL: https://issues.apache.org/jira/browse/HADOOP-18596 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Reporter: Mehakmeet Singh >Assignee: Mehakmeet Singh >Priority: Major > > Distcp -update currently relies on File size, block size, and Checksum > comparisons to figure out which files should be skipped or copied. > Since different cloud stores have different checksum algorithms we should > check for modification time as well to the checks. > This would ensure that while performing -update if the files are perceived to > be out of sync we should copy them. The machines between which the file > transfers occur should be in time sync to avoid any extra copies. > Improving testing and documentation for modification time checks between > different object stores to ensure no incorrect skipping of files. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-18596) Distcp -update between different cloud stores to use modification time while checking for file skip.
[ https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17677654#comment-17677654 ] Mehakmeet Singh commented on HADOOP-18596: -- {quote}How would you ensure that they are in sync, two clocks *perfectly* in sync is kind of looks tough. {quote} Good question. Although I am not sure if there's a way to have perfect time sync, I think that we can use NTP(it is already used widely) to minimize any time sync latency. Cloud service like AWS already has their internal time synchronization service and if we have NTP configured in the source machine, this should ensure that time is in sync between the two machines. Although not perfect, it should be enough for distcp -update to not skip the files incorrectly due to that. We should look out that there shouldn't be a massive difference between the clocks so that the updation of the source files from one version to another should be more recent than the previous version being synced to cloud storage for example. {quote}Do you plan to introduce an additional option for this or make it a default {quote} We are planning to have this by default since this adds more resilience to cases where checksum algorithms cannot be compared between different object stores. CC: [~ste...@apache.org] [~mthakur] > Distcp -update between different cloud stores to use modification time while > checking for file skip. > > > Key: HADOOP-18596 > URL: https://issues.apache.org/jira/browse/HADOOP-18596 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Reporter: Mehakmeet Singh >Assignee: Mehakmeet Singh >Priority: Major > > Distcp -update currently relies on File size, block size, and Checksum > comparisons to figure out which files should be skipped or copied. > Since different cloud stores have different checksum algorithms we should > check for modification time as well to the checks. > This would ensure that while performing -update if the files are perceived to > be out of sync we should copy them. The machines between which the file > transfers occur should be in time sync to avoid any extra copies. > Improving testing and documentation for modification time checks between > different object stores to ensure no incorrect skipping of files. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-18596) Distcp -update between different cloud stores to use modification time while checking for file skip.
[ https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17677623#comment-17677623 ] Ayush Saxena commented on HADOOP-18596: --- Just bumped into it while checking my mail and have a question here: {quote}The machines between which the file transfers occur should be in time sync to avoid any extra copies. {quote} How would you ensure that they are in sync, two clocks *perfectly* in sync is kind of looks tough. Extra copy ain't an issue, it will be hitting the performance only, not copying and having stale data is an issue. Do you plan to introduce an additional option for this or make it a default > Distcp -update between different cloud stores to use modification time while > checking for file skip. > > > Key: HADOOP-18596 > URL: https://issues.apache.org/jira/browse/HADOOP-18596 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Reporter: Mehakmeet Singh >Assignee: Mehakmeet Singh >Priority: Major > > Distcp -update currently relies on File size, block size, and Checksum > comparisons to figure out which files should be skipped or copied. > Since different cloud stores have different checksum algorithms we should > check for modification time as well to the checks. > This would ensure that while performing -update if the files are perceived to > be out of sync we should copy them. The machines between which the file > transfers occur should be in time sync to avoid any extra copies. > Improving testing and documentation for modification time checks between > different object stores to ensure no incorrect skipping of files. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org