[jira] [Commented] (HADOOP-18596) Distcp -update between different cloud stores to use modification time while checking for file skip.

2023-06-22 Thread Wei-Chiu Chuang (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17736275#comment-17736275
 ] 

Wei-Chiu Chuang commented on HADOOP-18596:
--

Yep. Agrreed. I think the right fix is on the hbase-filesystem side to update 
the test.

> Distcp -update between different cloud stores to use modification time while 
> checking for file skip.
> 
>
> Key: HADOOP-18596
> URL: https://issues.apache.org/jira/browse/HADOOP-18596
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: tools/distcp
>Reporter: Mehakmeet Singh
>Assignee: Mehakmeet Singh
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.3.6
>
>
> Distcp -update currently relies on File size, block size, and Checksum 
> comparisons to figure out which files should be skipped or copied. 
> Since different cloud stores have different checksum algorithms we should 
> check for modification time as well to the checks.
> This would ensure that while performing -update if the files are perceived to 
> be out of sync we should copy them. The machines between which the file 
> transfers occur should be in time sync to avoid any extra copies.
> Improving testing and documentation for modification time checks between 
> different object stores to ensure no incorrect skipping of files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-18596) Distcp -update between different cloud stores to use modification time while checking for file skip.

2023-06-21 Thread Wei-Chiu Chuang (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17735771#comment-17735771
 ] 

Wei-Chiu Chuang commented on HADOOP-18596:
--

Hi [~mehakmeetSingh] in case you missed the Hadoop 3.3.6 vote thread in the 
Hadoop dev mailing lists,
here's the excerpt:

[ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 9.007 s 
<<< FAILURE! - in org.apache.hadoop.hbase.regionserver.TestSyncTimeRangeTracker
[ERROR] 
org.apache.hadoop.hbase.regionserver.TestSyncTimeRangeTracker.testConcurrentIncludeTimestampCorrectness
  Time elapsed: 3.13 s  <<< ERROR!
java.lang.OutOfMemoryError: Java heap space
at 
org.apache.hadoop.hbase.regionserver.TestSyncTimeRangeTracker$RandomTestData.(TestSyncTimeRangeTracker.java:91)
at 
org.apache.hadoop.hbase.regionserver.TestSyncTimeRangeTracker.testConcurrentIncludeTimestampCorrectness(TestSyncTimeRangeTracker.java:156)

bq. hbase-filesystem has three test failures in TestHBOSSContractDistCp, and is 
not reproducible with Hadoop 3.3.5.
bq. [ERROR] Failures: [ERROR] 
TestHBOSSContractDistCp>AbstractContractDistCpTest.testDistCpUpdateCheckFileSkip:976->Assert.fail:88
 10 errors in file of length 10 
bq. [ERROR] 
TestHBOSSContractDistCp>AbstractContractDistCpTest.testUpdateDeepDirectoryStructureNoChange:270->AbstractContractDistCpTest.assertCounterInRange:290->Assert.assertTrue:41->Assert.fail:88
 Files Skipped value 0 too below minimum 1 
bq. [ERROR] 
TestHBOSSContractDistCp>AbstractContractDistCpTest.testUpdateDeepDirectoryStructureToRemote:259->AbstractContractDistCpTest.distCpUpdateDeepDirectoryStructure:334->AbstractContractDistCpTest.assertCounterInRange:294->Assert.assertTrue:41->Assert.fail:88
 Files Copied value 2 above maximum 1 
bq. [INFO] 
bq. [ERROR] Tests run: 240, Failures: 3, Errors: 0, Skipped: 58
bq. 
bq. 

> Distcp -update between different cloud stores to use modification time while 
> checking for file skip.
> 
>
> Key: HADOOP-18596
> URL: https://issues.apache.org/jira/browse/HADOOP-18596
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: tools/distcp
>Reporter: Mehakmeet Singh
>Assignee: Mehakmeet Singh
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.3.6
>
>
> Distcp -update currently relies on File size, block size, and Checksum 
> comparisons to figure out which files should be skipped or copied. 
> Since different cloud stores have different checksum algorithms we should 
> check for modification time as well to the checks.
> This would ensure that while performing -update if the files are perceived to 
> be out of sync we should copy them. The machines between which the file 
> transfers occur should be in time sync to avoid any extra copies.
> Improving testing and documentation for modification time checks between 
> different object stores to ensure no incorrect skipping of files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-18596) Distcp -update between different cloud stores to use modification time while checking for file skip.

2023-06-21 Thread Mehakmeet Singh (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17735579#comment-17735579
 ] 

Mehakmeet Singh commented on HADOOP-18596:
--

[~weichiu] Sorry seems like this comment got lost in my emails. Can you please 
point to the failed hbase Filesystem test? Is it the same as 
https://issues.apache.org/jira/browse/HADOOP-18633 and still failing even after 
the fix?

In terms of the behavior, I believe we want this to be turned on by default 
since this would be required in handling incorrect file skips for distcp 
updates when checksums are not compatible.

> Distcp -update between different cloud stores to use modification time while 
> checking for file skip.
> 
>
> Key: HADOOP-18596
> URL: https://issues.apache.org/jira/browse/HADOOP-18596
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: tools/distcp
>Reporter: Mehakmeet Singh
>Assignee: Mehakmeet Singh
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.3.6
>
>
> Distcp -update currently relies on File size, block size, and Checksum 
> comparisons to figure out which files should be skipped or copied. 
> Since different cloud stores have different checksum algorithms we should 
> check for modification time as well to the checks.
> This would ensure that while performing -update if the files are perceived to 
> be out of sync we should copy them. The machines between which the file 
> transfers occur should be in time sync to avoid any extra copies.
> Improving testing and documentation for modification time checks between 
> different object stores to ensure no incorrect skipping of files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-18596) Distcp -update between different cloud stores to use modification time while checking for file skip.

2023-06-17 Thread Wei-Chiu Chuang (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17733813#comment-17733813
 ] 

Wei-Chiu Chuang commented on HADOOP-18596:
--

This change and HADOOP-18633 failed a hbase-filesystem test. Should this 
behavior be turned on by default or should we fix the test in hbase-filesystem?

> Distcp -update between different cloud stores to use modification time while 
> checking for file skip.
> 
>
> Key: HADOOP-18596
> URL: https://issues.apache.org/jira/browse/HADOOP-18596
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: tools/distcp
>Reporter: Mehakmeet Singh
>Assignee: Mehakmeet Singh
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.3.6
>
>
> Distcp -update currently relies on File size, block size, and Checksum 
> comparisons to figure out which files should be skipped or copied. 
> Since different cloud stores have different checksum algorithms we should 
> check for modification time as well to the checks.
> This would ensure that while performing -update if the files are perceived to 
> be out of sync we should copy them. The machines between which the file 
> transfers occur should be in time sync to avoid any extra copies.
> Improving testing and documentation for modification time checks between 
> different object stores to ensure no incorrect skipping of files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-18596) Distcp -update between different cloud stores to use modification time while checking for file skip.

2023-02-15 Thread Mehakmeet Singh (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688975#comment-17688975
 ] 

Mehakmeet Singh commented on HADOOP-18596:
--

[~ayushtkn] Thanks for pointing it out, I'll open a new Jira and address this, 
think I see where the issue is.

> Distcp -update between different cloud stores to use modification time while 
> checking for file skip.
> 
>
> Key: HADOOP-18596
> URL: https://issues.apache.org/jira/browse/HADOOP-18596
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: tools/distcp
>Reporter: Mehakmeet Singh
>Assignee: Mehakmeet Singh
>Priority: Major
>  Labels: pull-request-available
>
> Distcp -update currently relies on File size, block size, and Checksum 
> comparisons to figure out which files should be skipped or copied. 
> Since different cloud stores have different checksum algorithms we should 
> check for modification time as well to the checks.
> This would ensure that while performing -update if the files are perceived to 
> be out of sync we should copy them. The machines between which the file 
> transfers occur should be in time sync to avoid any extra copies.
> Improving testing and documentation for modification time checks between 
> different object stores to ensure no incorrect skipping of files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-18596) Distcp -update between different cloud stores to use modification time while checking for file skip.

2023-02-15 Thread Ayush Saxena (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688959#comment-17688959
 ] 

Ayush Saxena commented on HADOOP-18596:
---

The introduced test seems flaky, fails always for me locally and it failed in 
the daily build as well. Can you check:
https://ci-hadoop.apache.org/view/Hadoop/job/hadoop-qbt-trunk-java8-linux-x86_64/1133/testReport/org.apache.hadoop.tools.contract/TestLocalContractDistCp/testDistCpUpdateCheckFileSkip/

> Distcp -update between different cloud stores to use modification time while 
> checking for file skip.
> 
>
> Key: HADOOP-18596
> URL: https://issues.apache.org/jira/browse/HADOOP-18596
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: tools/distcp
>Reporter: Mehakmeet Singh
>Assignee: Mehakmeet Singh
>Priority: Major
>  Labels: pull-request-available
>
> Distcp -update currently relies on File size, block size, and Checksum 
> comparisons to figure out which files should be skipped or copied. 
> Since different cloud stores have different checksum algorithms we should 
> check for modification time as well to the checks.
> This would ensure that while performing -update if the files are perceived to 
> be out of sync we should copy them. The machines between which the file 
> transfers occur should be in time sync to avoid any extra copies.
> Improving testing and documentation for modification time checks between 
> different object stores to ensure no incorrect skipping of files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-18596) Distcp -update between different cloud stores to use modification time while checking for file skip.

2023-02-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688400#comment-17688400
 ] 

ASF GitHub Bot commented on HADOOP-18596:
-

mehakmeet merged PR #5387:
URL: https://github.com/apache/hadoop/pull/5387




> Distcp -update between different cloud stores to use modification time while 
> checking for file skip.
> 
>
> Key: HADOOP-18596
> URL: https://issues.apache.org/jira/browse/HADOOP-18596
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: tools/distcp
>Reporter: Mehakmeet Singh
>Assignee: Mehakmeet Singh
>Priority: Major
>  Labels: pull-request-available
>
> Distcp -update currently relies on File size, block size, and Checksum 
> comparisons to figure out which files should be skipped or copied. 
> Since different cloud stores have different checksum algorithms we should 
> check for modification time as well to the checks.
> This would ensure that while performing -update if the files are perceived to 
> be out of sync we should copy them. The machines between which the file 
> transfers occur should be in time sync to avoid any extra copies.
> Improving testing and documentation for modification time checks between 
> different object stores to ensure no incorrect skipping of files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-18596) Distcp -update between different cloud stores to use modification time while checking for file skip.

2023-02-13 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687862#comment-17687862
 ] 

ASF GitHub Bot commented on HADOOP-18596:
-

hadoop-yetus commented on PR #5387:
URL: https://github.com/apache/hadoop/pull/5387#issuecomment-1427676401

   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |  10m 29s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 2 new or modified test files.  |
    _ branch-3.3 Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  43m  4s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  compile  |   0m 26s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  checkstyle  |   0m 27s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  mvnsite  |   0m 32s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  javadoc  |   0m 30s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  spotbugs  |   0m 55s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  shadedclient  |  28m 23s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 34s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 21s |  |  the patch passed  |
   | +1 :green_heart: |  javac  |   0m 21s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   0m 14s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   0m 23s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 16s |  |  the patch passed  |
   | +1 :green_heart: |  spotbugs  |   0m 48s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  27m 58s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |  16m  3s |  |  hadoop-distcp in the patch 
passed.  |
   | +1 :green_heart: |  asflicense  |   0m 33s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 132m 58s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.42 ServerAPI=1.42 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5387/1/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/5387 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux a974ffad64bf 4.15.0-200-generic #211-Ubuntu SMP Thu Nov 24 
18:16:04 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | branch-3.3 / 399de84d4e44775c948729f62206321ef1081338 |
   | Default Java | Private Build-1.8.0_352-8u352-ga-1~18.04-b08 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5387/1/testReport/ |
   | Max. process+thread count | 618 (vs. ulimit of 5500) |
   | modules | C: hadoop-tools/hadoop-distcp U: hadoop-tools/hadoop-distcp |
   | Console output | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5387/1/console |
   | versions | git=2.17.1 maven=3.6.0 spotbugs=4.2.2 |
   | Powered by | Apache Yetus 0.14.0 https://yetus.apache.org |
   
   
   This message was automatically generated.
   
   




> Distcp -update between different cloud stores to use modification time while 
> checking for file skip.
> 
>
> Key: HADOOP-18596
> URL: https://issues.apache.org/jira/browse/HADOOP-18596
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: tools/distcp
>Reporter: Mehakmeet Singh
>Assignee: Mehakmeet Singh
>Priority: Major
>  Labels: pull-request-available
>
> Distcp -update currently relies on File size, block size, and Checksum 
> comparisons to figure out which files should be skipped or copied. 
> Since different cloud stores have different checksum algorithms we should 
> check for modification time as well to the checks.
> This would ensure that while performing -update if the files are perceived to 
> be out of sync we should copy them. The machines between which the file 
> 

[jira] [Commented] (HADOOP-18596) Distcp -update between different cloud stores to use modification time while checking for file skip.

2023-02-13 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687783#comment-17687783
 ] 

ASF GitHub Bot commented on HADOOP-18596:
-

mehakmeet opened a new pull request, #5387:
URL: https://github.com/apache/hadoop/pull/5387

   ### Description of PR
   Adding toggleable support for modification time during distcp -update 
between two stores with incompatible checksum comparison.
   
   ### How was this patch tested?
   Compiled and ran the added tests on ABFS and S3A.
   
   ### For code changes:
   
   - [ ] Does the title or this PR starts with the corresponding JIRA issue id 
(e.g. 'HADOOP-17799. Your PR title ...')?
   - [ ] Object storage: have the integration tests been executed and the 
endpoint declared according to the connector-specific documentation?
   - [ ] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)?
   - [ ] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, 
`NOTICE-binary` files?
   
   




> Distcp -update between different cloud stores to use modification time while 
> checking for file skip.
> 
>
> Key: HADOOP-18596
> URL: https://issues.apache.org/jira/browse/HADOOP-18596
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: tools/distcp
>Reporter: Mehakmeet Singh
>Assignee: Mehakmeet Singh
>Priority: Major
>  Labels: pull-request-available
>
> Distcp -update currently relies on File size, block size, and Checksum 
> comparisons to figure out which files should be skipped or copied. 
> Since different cloud stores have different checksum algorithms we should 
> check for modification time as well to the checks.
> This would ensure that while performing -update if the files are perceived to 
> be out of sync we should copy them. The machines between which the file 
> transfers occur should be in time sync to avoid any extra copies.
> Improving testing and documentation for modification time checks between 
> different object stores to ensure no incorrect skipping of files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-18596) Distcp -update between different cloud stores to use modification time while checking for file skip.

2023-02-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17686715#comment-17686715
 ] 

ASF GitHub Bot commented on HADOOP-18596:
-

steveloughran commented on PR #5308:
URL: https://github.com/apache/hadoop/pull/5308#issuecomment-1424675867

   ok, you can backport to 3.3, but not to the 3.3.5 branch




> Distcp -update between different cloud stores to use modification time while 
> checking for file skip.
> 
>
> Key: HADOOP-18596
> URL: https://issues.apache.org/jira/browse/HADOOP-18596
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: tools/distcp
>Reporter: Mehakmeet Singh
>Assignee: Mehakmeet Singh
>Priority: Major
>  Labels: pull-request-available
>
> Distcp -update currently relies on File size, block size, and Checksum 
> comparisons to figure out which files should be skipped or copied. 
> Since different cloud stores have different checksum algorithms we should 
> check for modification time as well to the checks.
> This would ensure that while performing -update if the files are perceived to 
> be out of sync we should copy them. The machines between which the file 
> transfers occur should be in time sync to avoid any extra copies.
> Improving testing and documentation for modification time checks between 
> different object stores to ensure no incorrect skipping of files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-18596) Distcp -update between different cloud stores to use modification time while checking for file skip.

2023-02-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17686626#comment-17686626
 ] 

ASF GitHub Bot commented on HADOOP-18596:
-

mehakmeet merged PR #5308:
URL: https://github.com/apache/hadoop/pull/5308




> Distcp -update between different cloud stores to use modification time while 
> checking for file skip.
> 
>
> Key: HADOOP-18596
> URL: https://issues.apache.org/jira/browse/HADOOP-18596
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: tools/distcp
>Reporter: Mehakmeet Singh
>Assignee: Mehakmeet Singh
>Priority: Major
>  Labels: pull-request-available
>
> Distcp -update currently relies on File size, block size, and Checksum 
> comparisons to figure out which files should be skipped or copied. 
> Since different cloud stores have different checksum algorithms we should 
> check for modification time as well to the checks.
> This would ensure that while performing -update if the files are perceived to 
> be out of sync we should copy them. The machines between which the file 
> transfers occur should be in time sync to avoid any extra copies.
> Improving testing and documentation for modification time checks between 
> different object stores to ensure no incorrect skipping of files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-18596) Distcp -update between different cloud stores to use modification time while checking for file skip.

2023-02-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17686437#comment-17686437
 ] 

ASF GitHub Bot commented on HADOOP-18596:
-

hadoop-yetus commented on PR #5308:
URL: https://github.com/apache/hadoop/pull/5308#issuecomment-1424149799

   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 38s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 2 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  43m 24s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   0m 35s |  |  trunk passed with JDK 
Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04  |
   | +1 :green_heart: |  compile  |   0m 31s |  |  trunk passed with JDK 
Private Build-1.8.0_352-8u352-ga-1~20.04-b08  |
   | +1 :green_heart: |  checkstyle  |   0m 34s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   0m 37s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   0m 36s |  |  trunk passed with JDK 
Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04  |
   | +1 :green_heart: |  javadoc  |   0m 30s |  |  trunk passed with JDK 
Private Build-1.8.0_352-8u352-ga-1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   0m 59s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  23m 17s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 32s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 26s |  |  the patch passed with JDK 
Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04  |
   | +1 :green_heart: |  javac  |   0m 26s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 23s |  |  the patch passed with JDK 
Private Build-1.8.0_352-8u352-ga-1~20.04-b08  |
   | +1 :green_heart: |  javac  |   0m 23s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   0m 17s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   0m 26s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 21s |  |  the patch passed with JDK 
Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04  |
   | +1 :green_heart: |  javadoc  |   0m 19s |  |  the patch passed with JDK 
Private Build-1.8.0_352-8u352-ga-1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   0m 51s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  23m 15s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |  14m 57s |  |  hadoop-distcp in the patch 
passed.  |
   | +1 :green_heart: |  asflicense  |   0m 38s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 115m 32s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.42 ServerAPI=1.42 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5308/9/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/5308 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux 5f6ce632d6e0 4.15.0-200-generic #211-Ubuntu SMP Thu Nov 24 
18:16:04 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 58d8f84aa532f953da99ab8fe5bed9c28ea442f9 |
   | Default Java | Private Build-1.8.0_352-8u352-ga-1~20.04-b08 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 
/usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_352-8u352-ga-1~20.04-b08 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5308/9/testReport/ |
   | Max. process+thread count | 568 (vs. ulimit of 5500) |
   | modules | C: hadoop-tools/hadoop-distcp U: hadoop-tools/hadoop-distcp |
   | Console output | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5308/9/console |
   | versions | git=2.25.1 maven=3.6.3 spotbugs=4.2.2 |
   | Powered by | Apache Yetus 0.14.0 https://yetus.apache.org |
   
   
   This message was automatically generated.
   
   




> Distcp -update between different cloud stores to use modification time while 
> checking for 

[jira] [Commented] (HADOOP-18596) Distcp -update between different cloud stores to use modification time while checking for file skip.

2023-02-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17686365#comment-17686365
 ] 

ASF GitHub Bot commented on HADOOP-18596:
-

mehakmeet commented on code in PR #5308:
URL: https://github.com/apache/hadoop/pull/5308#discussion_r1101287643


##
hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/mapred/TestCopyCommitter.java:
##
@@ -562,11 +562,12 @@ private void testCommitWithChecksumMismatch(boolean 
skipCrc)
 Path sourcePath = new Path(sourceBase + srcFilename);
 CopyListingFileStatus sourceCurrStatus =
 new CopyListingFileStatus(fs.getFileStatus(sourcePath));
-Assert.assertFalse(!DistCpUtils.checksumsAreEqual(
-fs, new Path(sourceBase + srcFilename), null,
-fs, new Path(targetBase + srcFilename),
-sourceCurrStatus.getLen())
-.equals(CopyMapper.ChecksumComparison.FALSE));
+Assert.assertEquals("Checksum should not be equal",
+DistCpUtils.checksumsAreEqual(
+fs, new Path(sourceBase + srcFilename), null,
+fs, new Path(targetBase + srcFilename),
+sourceCurrStatus.getLen()),
+CopyMapper.ChecksumComparison.FALSE);

Review Comment:
   ooh, this is an old test, I changed the assertFalse to asserEquals but 
didn't realize the mistake I made. Thanks.





> Distcp -update between different cloud stores to use modification time while 
> checking for file skip.
> 
>
> Key: HADOOP-18596
> URL: https://issues.apache.org/jira/browse/HADOOP-18596
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: tools/distcp
>Reporter: Mehakmeet Singh
>Assignee: Mehakmeet Singh
>Priority: Major
>  Labels: pull-request-available
>
> Distcp -update currently relies on File size, block size, and Checksum 
> comparisons to figure out which files should be skipped or copied. 
> Since different cloud stores have different checksum algorithms we should 
> check for modification time as well to the checks.
> This would ensure that while performing -update if the files are perceived to 
> be out of sync we should copy them. The machines between which the file 
> transfers occur should be in time sync to avoid any extra copies.
> Improving testing and documentation for modification time checks between 
> different object stores to ensure no incorrect skipping of files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-18596) Distcp -update between different cloud stores to use modification time while checking for file skip.

2023-02-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17686362#comment-17686362
 ] 

ASF GitHub Bot commented on HADOOP-18596:
-

steveloughran commented on code in PR #5308:
URL: https://github.com/apache/hadoop/pull/5308#discussion_r1101279463


##
hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/mapred/TestCopyCommitter.java:
##
@@ -562,11 +562,12 @@ private void testCommitWithChecksumMismatch(boolean 
skipCrc)
 Path sourcePath = new Path(sourceBase + srcFilename);
 CopyListingFileStatus sourceCurrStatus =
 new CopyListingFileStatus(fs.getFileStatus(sourcePath));
-Assert.assertFalse(!DistCpUtils.checksumsAreEqual(
-fs, new Path(sourceBase + srcFilename), null,
-fs, new Path(targetBase + srcFilename),
-sourceCurrStatus.getLen())
-.equals(CopyMapper.ChecksumComparison.FALSE));
+Assert.assertEquals("Checksum should not be equal",
+DistCpUtils.checksumsAreEqual(
+fs, new Path(sourceBase + srcFilename), null,
+fs, new Path(targetBase + srcFilename),
+sourceCurrStatus.getLen()),
+CopyMapper.ChecksumComparison.FALSE);

Review Comment:
   good test, but you need to put the expected value first, so that 
assertEquals prints the right "expected 1 actual 2" message. bit of PITA





> Distcp -update between different cloud stores to use modification time while 
> checking for file skip.
> 
>
> Key: HADOOP-18596
> URL: https://issues.apache.org/jira/browse/HADOOP-18596
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: tools/distcp
>Reporter: Mehakmeet Singh
>Assignee: Mehakmeet Singh
>Priority: Major
>  Labels: pull-request-available
>
> Distcp -update currently relies on File size, block size, and Checksum 
> comparisons to figure out which files should be skipped or copied. 
> Since different cloud stores have different checksum algorithms we should 
> check for modification time as well to the checks.
> This would ensure that while performing -update if the files are perceived to 
> be out of sync we should copy them. The machines between which the file 
> transfers occur should be in time sync to avoid any extra copies.
> Improving testing and documentation for modification time checks between 
> different object stores to ensure no incorrect skipping of files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-18596) Distcp -update between different cloud stores to use modification time while checking for file skip.

2023-02-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17686267#comment-17686267
 ] 

ASF GitHub Bot commented on HADOOP-18596:
-

hadoop-yetus commented on PR #5308:
URL: https://github.com/apache/hadoop/pull/5308#issuecomment-1423801752

   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 39s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  1s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  1s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 2 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  43m 23s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   0m 37s |  |  trunk passed with JDK 
Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04  |
   | +1 :green_heart: |  compile  |   0m 31s |  |  trunk passed with JDK 
Private Build-1.8.0_352-8u352-ga-1~20.04-b08  |
   | +1 :green_heart: |  checkstyle  |   0m 33s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   0m 36s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   0m 38s |  |  trunk passed with JDK 
Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04  |
   | +1 :green_heart: |  javadoc  |   0m 29s |  |  trunk passed with JDK 
Private Build-1.8.0_352-8u352-ga-1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   0m 59s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  23m  4s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 32s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 25s |  |  the patch passed with JDK 
Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04  |
   | +1 :green_heart: |  javac  |   0m 25s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 23s |  |  the patch passed with JDK 
Private Build-1.8.0_352-8u352-ga-1~20.04-b08  |
   | +1 :green_heart: |  javac  |   0m 23s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   0m 16s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   0m 26s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 20s |  |  the patch passed with JDK 
Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04  |
   | +1 :green_heart: |  javadoc  |   0m 20s |  |  the patch passed with JDK 
Private Build-1.8.0_352-8u352-ga-1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   0m 50s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  23m  1s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |  15m 17s |  |  hadoop-distcp in the patch 
passed.  |
   | +1 :green_heart: |  asflicense  |   0m 37s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 115m 30s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.42 ServerAPI=1.42 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5308/8/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/5308 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux 51f2095ec10e 4.15.0-200-generic #211-Ubuntu SMP Thu Nov 24 
18:16:04 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 0f63b45b01ad69d7ccc810a52b22dbcfbab4c0cc |
   | Default Java | Private Build-1.8.0_352-8u352-ga-1~20.04-b08 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 
/usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_352-8u352-ga-1~20.04-b08 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5308/8/testReport/ |
   | Max. process+thread count | 768 (vs. ulimit of 5500) |
   | modules | C: hadoop-tools/hadoop-distcp U: hadoop-tools/hadoop-distcp |
   | Console output | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5308/8/console |
   | versions | git=2.25.1 maven=3.6.3 spotbugs=4.2.2 |
   | Powered by | Apache Yetus 0.14.0 https://yetus.apache.org |
   
   
   This message was automatically generated.
   
   




> Distcp -update between different cloud stores to use modification time while 
> checking for 

[jira] [Commented] (HADOOP-18596) Distcp -update between different cloud stores to use modification time while checking for file skip.

2023-02-08 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17686220#comment-17686220
 ] 

ASF GitHub Bot commented on HADOOP-18596:
-

mehakmeet commented on PR #5308:
URL: https://github.com/apache/hadoop/pull/5308#issuecomment-1423704609

   Have made the changes @steveloughran suggested including changing ">" to 
">=". 
   
   Feel like we can have both strictly greater or greater equals for the check, 
the latter we would be taking a slight risk that the source file may have 
changed at the same time the last sync took place and we would be skipping the 
copy in that case, and the former in which we can have an additional copy even 
if there's no content changed but the mod time is same for both source and 
target. Shouldn't we prioritize accuracy here?
   Any more thoughts on if we should change this or keep ">="?




> Distcp -update between different cloud stores to use modification time while 
> checking for file skip.
> 
>
> Key: HADOOP-18596
> URL: https://issues.apache.org/jira/browse/HADOOP-18596
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: tools/distcp
>Reporter: Mehakmeet Singh
>Assignee: Mehakmeet Singh
>Priority: Major
>  Labels: pull-request-available
>
> Distcp -update currently relies on File size, block size, and Checksum 
> comparisons to figure out which files should be skipped or copied. 
> Since different cloud stores have different checksum algorithms we should 
> check for modification time as well to the checks.
> This would ensure that while performing -update if the files are perceived to 
> be out of sync we should copy them. The machines between which the file 
> transfers occur should be in time sync to avoid any extra copies.
> Improving testing and documentation for modification time checks between 
> different object stores to ensure no incorrect skipping of files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-18596) Distcp -update between different cloud stores to use modification time while checking for file skip.

2023-02-08 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685982#comment-17685982
 ] 

ASF GitHub Bot commented on HADOOP-18596:
-

mehakmeet commented on code in PR #5308:
URL: https://github.com/apache/hadoop/pull/5308#discussion_r1100421615


##
hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/CopyMapper.java:
##
@@ -361,31 +373,60 @@ private boolean canSkip(FileSystem sourceFS, 
CopyListingFileStatus source,
 if (sameLength && source.getLen() == 0) {
   return true;
 }
-// if both the source and target have the same length, then check if the
-// config to use modification time is set to true, then use the
-// modification time and checksum comparison to determine if the copy can
-// be skipped else if not set then just use the checksum comparison to
-// check copy skip.
+// If the src and target file have same size and block size, we would
+// check if the checkCrc flag is enabled or not. If enabled, and the
+// modTime comparison is enabled then return true if target file is older
+// than the source file, since this indicates that the target file is
+// recently updated and the source is not changed more recently than the
+// update, we can skip the copy else we would copy.
+// If skipCrc flag is disabled, we would check the checksum comparison
+// which is an enum representing 3 values, of which if the comparison
+// returns NOT_COMPATIBLE, we'll try to check modtime again, else return
+// the result of checksum comparison which are compatible(true or false).
 //
 // Note: Different object stores can have different checksum algorithms
 // resulting in no checksum comparison that results in return true
 // always, having the modification time enabled can help in these
 // scenarios to not incorrectly skip a copy. Refer: HADOOP-18596.
+
 if (sameLength && sameBlockSize) {
-  if (useModTimeToUpdate) {
-return
-(source.getModificationTime() < target.getModificationTime()) &&
-(skipCrc || DistCpUtils.checksumsAreEqual(sourceFS,
-source.getPath(), null,
-targetFS, target.getPath(), source.getLen()));
+  if (skipCrc) {
+return maybeUseModTimeToCompare(source, target);
   } else {
-return skipCrc || DistCpUtils
+ChecksumComparison checksumComparison = DistCpUtils
 .checksumsAreEqual(sourceFS, source.getPath(), null,
 targetFS, target.getPath(), source.getLen());
+LOG.debug("Result of checksum comparison between src {} and target "
++ "{} : {}", source, target, checksumComparison);
+if (checksumComparison.equals(ChecksumComparison.INCOMPATIBLE)) {
+  return maybeUseModTimeToCompare(source, target);
+}
+// if skipCrc is disabled and checksumComparison is compatible we
+// need not check the mod time.
+return checksumComparison.equals(ChecksumComparison.TRUE);
   }
-} else {
-  return false;
 }
+return false;
+  }
+
+  /**
+   * If the mod time comparison is enabled, check the mod time else return
+   * false.
+   * Comparison: If the target file perceives to have greater mod time(older)
+   * than the source file, we can assume that there has been no new changes
+   * that occurred in the source file, hence we should return true to skip the
+   * copy of the file.
+   * @param source Source fileStatus.
+   * @param target Target fileStatus.
+   * @return boolean representing result of modTime check.
+   */
+  private boolean maybeUseModTimeToCompare(
+  CopyListingFileStatus source, FileStatus target) {
+if (useModTimeToUpdate) {
+  return source.getModificationTime() < target.getModificationTime();

Review Comment:
   hmm, good point.
   
   just thinking if there would ever be a scenario when the source file is 
updated at the same time as it is synced to a different store, so we can have 
"=" to skip the copy...





> Distcp -update between different cloud stores to use modification time while 
> checking for file skip.
> 
>
> Key: HADOOP-18596
> URL: https://issues.apache.org/jira/browse/HADOOP-18596
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: tools/distcp
>Reporter: Mehakmeet Singh
>Assignee: Mehakmeet Singh
>Priority: Major
>  Labels: pull-request-available
>
> Distcp -update currently relies on File size, block size, and Checksum 
> comparisons to figure out which files should be skipped or copied. 
> Since different cloud stores have different checksum algorithms we should 
> check for 

[jira] [Commented] (HADOOP-18596) Distcp -update between different cloud stores to use modification time while checking for file skip.

2023-02-08 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685980#comment-17685980
 ] 

ASF GitHub Bot commented on HADOOP-18596:
-

mehakmeet commented on code in PR #5308:
URL: https://github.com/apache/hadoop/pull/5308#discussion_r1100412015


##
hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/util/DistCpUtils.java:
##
@@ -613,8 +623,12 @@ public static void compareFileLengthsAndChecksums(long 
srcLen,
 
 //At this point, src & dest lengths are same. if length==0, we skip 
checksum
 if ((srcLen != 0) && (!skipCrc)) {
-  if (!checksumsAreEqual(sourceFS, source, sourceChecksum,
-  targetFS, target, srcLen)) {
+  CopyMapper.ChecksumComparison
+  checksumComparison = checksumsAreEqual(sourceFS, source, 
sourceChecksum,
+  targetFS, target, srcLen);
+  // If Checksum comparison is false set it to false, else set to true.
+  boolean checksumResult = 
!checksumComparison.equals(CopyMapper.ChecksumComparison.FALSE);

Review Comment:
   We'll be setting "checksumResult" to be true for both "INCOMPATIBLE" and 
"TRUE" result from checksumsAreEqual() method else false and go through L632, 
so, we would be following the same flow as before since incompatible result 
from this method was true earlier too.





> Distcp -update between different cloud stores to use modification time while 
> checking for file skip.
> 
>
> Key: HADOOP-18596
> URL: https://issues.apache.org/jira/browse/HADOOP-18596
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: tools/distcp
>Reporter: Mehakmeet Singh
>Assignee: Mehakmeet Singh
>Priority: Major
>  Labels: pull-request-available
>
> Distcp -update currently relies on File size, block size, and Checksum 
> comparisons to figure out which files should be skipped or copied. 
> Since different cloud stores have different checksum algorithms we should 
> check for modification time as well to the checks.
> This would ensure that while performing -update if the files are perceived to 
> be out of sync we should copy them. The machines between which the file 
> transfers occur should be in time sync to avoid any extra copies.
> Improving testing and documentation for modification time checks between 
> different object stores to ensure no incorrect skipping of files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-18596) Distcp -update between different cloud stores to use modification time while checking for file skip.

2023-02-08 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685976#comment-17685976
 ] 

ASF GitHub Bot commented on HADOOP-18596:
-

steveloughran commented on code in PR #5308:
URL: https://github.com/apache/hadoop/pull/5308#discussion_r1100395250


##
hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/util/DistCpUtils.java:
##
@@ -613,8 +623,12 @@ public static void compareFileLengthsAndChecksums(long 
srcLen,
 
 //At this point, src & dest lengths are same. if length==0, we skip 
checksum
 if ((srcLen != 0) && (!skipCrc)) {
-  if (!checksumsAreEqual(sourceFS, source, sourceChecksum,
-  targetFS, target, srcLen)) {
+  CopyMapper.ChecksumComparison
+  checksumComparison = checksumsAreEqual(sourceFS, source, 
sourceChecksum,
+  targetFS, target, srcLen);
+  // If Checksum comparison is false set it to false, else set to true.
+  boolean checksumResult = 
!checksumComparison.equals(CopyMapper.ChecksumComparison.FALSE);

Review Comment:
   is this outcome right. as L632 should be reached for any outcome other than 
True.



##
hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/CopyMapper.java:
##
@@ -361,31 +373,60 @@ private boolean canSkip(FileSystem sourceFS, 
CopyListingFileStatus source,
 if (sameLength && source.getLen() == 0) {
   return true;
 }
-// if both the source and target have the same length, then check if the
-// config to use modification time is set to true, then use the
-// modification time and checksum comparison to determine if the copy can
-// be skipped else if not set then just use the checksum comparison to
-// check copy skip.
+// If the src and target file have same size and block size, we would
+// check if the checkCrc flag is enabled or not. If enabled, and the
+// modTime comparison is enabled then return true if target file is older
+// than the source file, since this indicates that the target file is
+// recently updated and the source is not changed more recently than the
+// update, we can skip the copy else we would copy.
+// If skipCrc flag is disabled, we would check the checksum comparison
+// which is an enum representing 3 values, of which if the comparison
+// returns NOT_COMPATIBLE, we'll try to check modtime again, else return
+// the result of checksum comparison which are compatible(true or false).
 //
 // Note: Different object stores can have different checksum algorithms
 // resulting in no checksum comparison that results in return true
 // always, having the modification time enabled can help in these
 // scenarios to not incorrectly skip a copy. Refer: HADOOP-18596.
+
 if (sameLength && sameBlockSize) {
-  if (useModTimeToUpdate) {
-return
-(source.getModificationTime() < target.getModificationTime()) &&
-(skipCrc || DistCpUtils.checksumsAreEqual(sourceFS,
-source.getPath(), null,
-targetFS, target.getPath(), source.getLen()));
+  if (skipCrc) {
+return maybeUseModTimeToCompare(source, target);
   } else {
-return skipCrc || DistCpUtils
+ChecksumComparison checksumComparison = DistCpUtils
 .checksumsAreEqual(sourceFS, source.getPath(), null,
 targetFS, target.getPath(), source.getLen());
+LOG.debug("Result of checksum comparison between src {} and target "
++ "{} : {}", source, target, checksumComparison);
+if (checksumComparison.equals(ChecksumComparison.INCOMPATIBLE)) {
+  return maybeUseModTimeToCompare(source, target);
+}
+// if skipCrc is disabled and checksumComparison is compatible we
+// need not check the mod time.
+return checksumComparison.equals(ChecksumComparison.TRUE);
   }
-} else {
-  return false;
 }
+return false;
+  }
+
+  /**
+   * If the mod time comparison is enabled, check the mod time else return
+   * false.
+   * Comparison: If the target file perceives to have greater mod time(older)
+   * than the source file, we can assume that there has been no new changes
+   * that occurred in the source file, hence we should return true to skip the
+   * copy of the file.
+   * @param source Source fileStatus.
+   * @param target Target fileStatus.
+   * @return boolean representing result of modTime check.
+   */
+  private boolean maybeUseModTimeToCompare(
+  CopyListingFileStatus source, FileStatus target) {
+if (useModTimeToUpdate) {
+  return source.getModificationTime() < target.getModificationTime();

Review Comment:
   should this be <= ?



##
hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/mapred/TestCopyCommitter.java:
##
@@ -562,9 

[jira] [Commented] (HADOOP-18596) Distcp -update between different cloud stores to use modification time while checking for file skip.

2023-02-08 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685929#comment-17685929
 ] 

ASF GitHub Bot commented on HADOOP-18596:
-

hadoop-yetus commented on PR #5308:
URL: https://github.com/apache/hadoop/pull/5308#issuecomment-1422712877

   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 57s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  1s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  1s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 2 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  45m 47s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   0m 30s |  |  trunk passed with JDK 
Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04  |
   | +1 :green_heart: |  compile  |   0m 26s |  |  trunk passed with JDK 
Private Build-1.8.0_352-8u352-ga-1~20.04-b08  |
   | +1 :green_heart: |  checkstyle  |   0m 27s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   0m 32s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   0m 32s |  |  trunk passed with JDK 
Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04  |
   | +1 :green_heart: |  javadoc  |   0m 24s |  |  trunk passed with JDK 
Private Build-1.8.0_352-8u352-ga-1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   0m 53s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  26m  9s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 30s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 24s |  |  the patch passed with JDK 
Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04  |
   | +1 :green_heart: |  javac  |   0m 24s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 20s |  |  the patch passed with JDK 
Private Build-1.8.0_352-8u352-ga-1~20.04-b08  |
   | +1 :green_heart: |  javac  |   0m 20s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   0m 15s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   0m 24s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 18s |  |  the patch passed with JDK 
Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04  |
   | +1 :green_heart: |  javadoc  |   0m 18s |  |  the patch passed with JDK 
Private Build-1.8.0_352-8u352-ga-1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   0m 48s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  25m 50s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |  45m 52s |  |  hadoop-distcp in the patch 
passed.  |
   | +1 :green_heart: |  asflicense  |   0m 33s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 153m 35s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.42 ServerAPI=1.42 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5308/6/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/5308 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux 84304b0f676c 4.15.0-200-generic #211-Ubuntu SMP Thu Nov 24 
18:16:04 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 4ff7f36138039a8ed90a0fd20af0d7b32f5a752e |
   | Default Java | Private Build-1.8.0_352-8u352-ga-1~20.04-b08 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 
/usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_352-8u352-ga-1~20.04-b08 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5308/6/testReport/ |
   | Max. process+thread count | 607 (vs. ulimit of 5500) |
   | modules | C: hadoop-tools/hadoop-distcp U: hadoop-tools/hadoop-distcp |
   | Console output | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5308/6/console |
   | versions | git=2.25.1 maven=3.6.3 spotbugs=4.2.2 |
   | Powered by | Apache Yetus 0.14.0 https://yetus.apache.org |
   
   
   This message was automatically generated.
   
   




> Distcp -update between different cloud stores to use modification time while 
> checking for 

[jira] [Commented] (HADOOP-18596) Distcp -update between different cloud stores to use modification time while checking for file skip.

2023-02-08 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685926#comment-17685926
 ] 

ASF GitHub Bot commented on HADOOP-18596:
-

hadoop-yetus commented on PR #5308:
URL: https://github.com/apache/hadoop/pull/5308#issuecomment-1422701704

   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 57s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 2 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  43m 17s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   0m 34s |  |  trunk passed with JDK 
Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04  |
   | +1 :green_heart: |  compile  |   0m 32s |  |  trunk passed with JDK 
Private Build-1.8.0_352-8u352-ga-1~20.04-b08  |
   | +1 :green_heart: |  checkstyle  |   0m 33s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   0m 37s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   0m 36s |  |  trunk passed with JDK 
Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04  |
   | +1 :green_heart: |  javadoc  |   0m 30s |  |  trunk passed with JDK 
Private Build-1.8.0_352-8u352-ga-1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   1m  0s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  23m 33s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 33s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 26s |  |  the patch passed with JDK 
Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04  |
   | +1 :green_heart: |  javac  |   0m 26s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 24s |  |  the patch passed with JDK 
Private Build-1.8.0_352-8u352-ga-1~20.04-b08  |
   | +1 :green_heart: |  javac  |   0m 24s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   0m 16s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   0m 27s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 20s |  |  the patch passed with JDK 
Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04  |
   | +1 :green_heart: |  javadoc  |   0m 19s |  |  the patch passed with JDK 
Private Build-1.8.0_352-8u352-ga-1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   0m 50s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  22m 58s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |  38m 41s |  |  hadoop-distcp in the patch 
passed.  |
   | +1 :green_heart: |  asflicense  |   0m 38s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 139m 35s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.42 ServerAPI=1.42 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5308/7/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/5308 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux 32cba67aa5ec 4.15.0-200-generic #211-Ubuntu SMP Thu Nov 24 
18:16:04 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / d64e6b66d4851b711c40dfaf0b9ffc2a8e5a24ac |
   | Default Java | Private Build-1.8.0_352-8u352-ga-1~20.04-b08 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 
/usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_352-8u352-ga-1~20.04-b08 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5308/7/testReport/ |
   | Max. process+thread count | 630 (vs. ulimit of 5500) |
   | modules | C: hadoop-tools/hadoop-distcp U: hadoop-tools/hadoop-distcp |
   | Console output | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5308/7/console |
   | versions | git=2.25.1 maven=3.6.3 spotbugs=4.2.2 |
   | Powered by | Apache Yetus 0.14.0 https://yetus.apache.org |
   
   
   This message was automatically generated.
   
   




> Distcp -update between different cloud stores to use modification time while 
> checking for 

[jira] [Commented] (HADOOP-18596) Distcp -update between different cloud stores to use modification time while checking for file skip.

2023-02-06 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17684868#comment-17684868
 ] 

ASF GitHub Bot commented on HADOOP-18596:
-

steveloughran commented on code in PR #5308:
URL: https://github.com/apache/hadoop/pull/5308#discussion_r109011


##
hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCpConstants.java:
##
@@ -142,6 +142,19 @@ private DistCpConstants() {
   "distcp.blocks.per.chunk";
 
   public static final String CONF_LABEL_USE_ITERATOR = "distcp.use.iterator";
+
+  /**
+   * Enabling distcp -update to use modification time of source and target

Review Comment:
   nit, use {@code distcp -update} for the better formatting



##
hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/CopyMapper.java:
##
@@ -114,6 +115,8 @@ public void setup(Context context) throws IOException, 
InterruptedException {
 PRESERVE_STATUS.getConfigLabel()));
 directWrite = conf.getBoolean(
 DistCpOptionSwitch.DIRECT_WRITE.getConfigLabel(), false);
+useModTimeToUpdate =
+conf.getBoolean(DistCpConstants.CONF_LABEL_UPDATE_MOD_TIME, true);

Review Comment:
   refer to that proposed constant for a default value



##
hadoop-tools/hadoop-distcp/src/site/markdown/DistCp.md.vm:
##
@@ -631,14 +631,39 @@ hadoop distcp -update -numListstatusThreads 20  \
 Because object stores are slow to list files, consider setting the 
`-numListstatusThreads` option when performing a `-update` operation
 on a large directory tree (the limit is 40 threads).
 
-When `DistCp -update` is used with object stores,
-generally only the modification time and length of the individual files are 
compared,
-not any checksums. The fact that most object stores do have valid timestamps
-for directories is irrelevant; only the file timestamps are compared.
-However, it is important to have the clock of the client computers close
-to that of the infrastructure, so that timestamps are consistent between
-the client/HDFS cluster and that of the object store. Otherwise, changed files 
may be
-missed/copied too often.
+When `DistCp -update` is used with object stores, generally only the
+modification time and length of the individual files are compared, not any
+checksums if the checksum algorithm between the two stores is different.
+
+* The `distcp -update` between two object stores with different checksum
+  algorithm compares the modification times of source and target files along
+  with the file size to determine whether to skip the file copy. The behavior
+  is controlled by the property `distcp.update.modification.time`, which is
+  set to true by default. If the source file is more recently modified than
+  the target file, it is assumed that the content has changed, and the file
+  should be updated.
+  We need to ensure that there is no clock skew between the machines.
+  The fact that most object stores do have valid timestamps for directories
+  is irrelevant; only the file timestamps are compared. However, it is
+  important to have the clock of the client computers close to that of the
+  infrastructure, so that timestamps are consistent between the client/HDFS
+  cluster and that of the object store. Otherwise, changed files may be
+  missed/copied too often.
+
+* `distcp.update.modification.time` can be used alongside the checksum check
+  in stores with same checksum algorithm as well. if set to true we check
+  both modification time and checksum between the files, but if this property

Review Comment:
   ok. and the default option is "don't use checksums". as i was thinking if we 
would want to have this on automatically if you are on -skipCrc or the formats 
are incompatible.
   
   but if we leave it something to explicitly ask for, your code looks right



##
hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/CopyMapper.java:
##
@@ -85,6 +85,7 @@ static enum FileAction {
   private boolean append = false;
   private boolean verboseLog = false;
   private boolean directWrite = false;
+  private boolean useModTimeToUpdate = true;

Review Comment:
   add a constant for the default value





> Distcp -update between different cloud stores to use modification time while 
> checking for file skip.
> 
>
> Key: HADOOP-18596
> URL: https://issues.apache.org/jira/browse/HADOOP-18596
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: tools/distcp
>Reporter: Mehakmeet Singh
>Assignee: Mehakmeet Singh
>Priority: Major
>  Labels: pull-request-available
>
> Distcp -update currently relies on File size, block size, and Checksum 
> comparisons to figure out which 

[jira] [Commented] (HADOOP-18596) Distcp -update between different cloud stores to use modification time while checking for file skip.

2023-02-05 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1768#comment-1768
 ] 

ASF GitHub Bot commented on HADOOP-18596:
-

mehakmeet commented on code in PR #5308:
URL: https://github.com/apache/hadoop/pull/5308#discussion_r1096995207


##
hadoop-tools/hadoop-distcp/src/site/markdown/DistCp.md.vm:
##
@@ -631,14 +631,39 @@ hadoop distcp -update -numListstatusThreads 20  \
 Because object stores are slow to list files, consider setting the 
`-numListstatusThreads` option when performing a `-update` operation
 on a large directory tree (the limit is 40 threads).
 
-When `DistCp -update` is used with object stores,
-generally only the modification time and length of the individual files are 
compared,
-not any checksums. The fact that most object stores do have valid timestamps
-for directories is irrelevant; only the file timestamps are compared.
-However, it is important to have the clock of the client computers close
-to that of the infrastructure, so that timestamps are consistent between
-the client/HDFS cluster and that of the object store. Otherwise, changed files 
may be
-missed/copied too often.
+When `DistCp -update` is used with object stores, generally only the
+modification time and length of the individual files are compared, not any
+checksums if the checksum algorithm between the two stores is different.
+
+* The `distcp -update` between two object stores with different checksum
+  algorithm compares the modification times of source and target files along
+  with the file size to determine whether to skip the file copy. The behavior
+  is controlled by the property `distcp.update.modification.time`, which is
+  set to true by default. If the source file is more recently modified than
+  the target file, it is assumed that the content has changed, and the file
+  should be updated.
+  We need to ensure that there is no clock skew between the machines.
+  The fact that most object stores do have valid timestamps for directories
+  is irrelevant; only the file timestamps are compared. However, it is
+  important to have the clock of the client computers close to that of the
+  infrastructure, so that timestamps are consistent between the client/HDFS
+  cluster and that of the object store. Otherwise, changed files may be
+  missed/copied too often.
+
+* `distcp.update.modification.time` can be used alongside the checksum check
+  in stores with same checksum algorithm as well. if set to true we check
+  both modification time and checksum between the files, but if this property

Review Comment:
   The timestamps are only used alongside checksums if we have set the config 
to true, else we would follow the default way that is offered today(So, we can 
switch off in cases where we know checksums would work). 
   
   Since S3A/ABFS has checksums disabled we are returned null for the checksum 
value, we'll always see true for that case, but it can be true for cases where 
the checksums actually are identical too, so if we rely on checksum check to be 
true and then don't compare the timestamp, that can give false skips.
   
   So, should we check the timestamps inside of the checksum check instead? 
Like if the checksums for both source and  target are not null and if we have 
the property set to true then do the mod time check? This would add few more 
changes as we would need to change the params inside different classes to pass 
the config value as well. 
   
   We can always have the default value as false and use the property in the 
cases we want as well to keep the default way as the one offered today too.
   





> Distcp -update between different cloud stores to use modification time while 
> checking for file skip.
> 
>
> Key: HADOOP-18596
> URL: https://issues.apache.org/jira/browse/HADOOP-18596
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: tools/distcp
>Reporter: Mehakmeet Singh
>Assignee: Mehakmeet Singh
>Priority: Major
>  Labels: pull-request-available
>
> Distcp -update currently relies on File size, block size, and Checksum 
> comparisons to figure out which files should be skipped or copied. 
> Since different cloud stores have different checksum algorithms we should 
> check for modification time as well to the checks.
> This would ensure that while performing -update if the files are perceived to 
> be out of sync we should copy them. The machines between which the file 
> transfers occur should be in time sync to avoid any extra copies.
> Improving testing and documentation for modification time checks between 
> different object stores to ensure no incorrect skipping of files.



--
This message was sent by Atlassian 

[jira] [Commented] (HADOOP-18596) Distcp -update between different cloud stores to use modification time while checking for file skip.

2023-02-02 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17683527#comment-17683527
 ] 

ASF GitHub Bot commented on HADOOP-18596:
-

steveloughran commented on code in PR #5308:
URL: https://github.com/apache/hadoop/pull/5308#discussion_r1094941352


##
hadoop-tools/hadoop-distcp/src/site/markdown/DistCp.md.vm:
##
@@ -631,14 +631,39 @@ hadoop distcp -update -numListstatusThreads 20  \
 Because object stores are slow to list files, consider setting the 
`-numListstatusThreads` option when performing a `-update` operation
 on a large directory tree (the limit is 40 threads).
 
-When `DistCp -update` is used with object stores,
-generally only the modification time and length of the individual files are 
compared,
-not any checksums. The fact that most object stores do have valid timestamps
-for directories is irrelevant; only the file timestamps are compared.
-However, it is important to have the clock of the client computers close
-to that of the infrastructure, so that timestamps are consistent between
-the client/HDFS cluster and that of the object store. Otherwise, changed files 
may be
-missed/copied too often.
+When `DistCp -update` is used with object stores, generally only the
+modification time and length of the individual files are compared, not any
+checksums if the checksum algorithm between the two stores is different.
+
+* The `distcp -update` between two object stores with different checksum
+  algorithm compares the modification times of source and target files along
+  with the file size to determine whether to skip the file copy. The behavior
+  is controlled by the property `distcp.update.modification.time`, which is
+  set to true by default. If the source file is more recently modified than
+  the target file, it is assumed that the content has changed, and the file
+  should be updated.
+  We need to ensure that there is no clock skew between the machines.
+  The fact that most object stores do have valid timestamps for directories
+  is irrelevant; only the file timestamps are compared. However, it is
+  important to have the clock of the client computers close to that of the
+  infrastructure, so that timestamps are consistent between the client/HDFS
+  cluster and that of the object store. Otherwise, changed files may be
+  missed/copied too often.
+
+* `distcp.update.modification.time` can be used alongside the checksum check
+  in stores with same checksum algorithm as well. if set to true we check
+  both modification time and checksum between the files, but if this property

Review Comment:
   really? I think if checksums are matching then timestamps shouldn't be 
compared at all. If two files' checksums match, that is sufficient to say "they 
are the same"





> Distcp -update between different cloud stores to use modification time while 
> checking for file skip.
> 
>
> Key: HADOOP-18596
> URL: https://issues.apache.org/jira/browse/HADOOP-18596
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: tools/distcp
>Reporter: Mehakmeet Singh
>Assignee: Mehakmeet Singh
>Priority: Major
>  Labels: pull-request-available
>
> Distcp -update currently relies on File size, block size, and Checksum 
> comparisons to figure out which files should be skipped or copied. 
> Since different cloud stores have different checksum algorithms we should 
> check for modification time as well to the checks.
> This would ensure that while performing -update if the files are perceived to 
> be out of sync we should copy them. The machines between which the file 
> transfers occur should be in time sync to avoid any extra copies.
> Improving testing and documentation for modification time checks between 
> different object stores to ensure no incorrect skipping of files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-18596) Distcp -update between different cloud stores to use modification time while checking for file skip.

2023-02-02 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17683457#comment-17683457
 ] 

ASF GitHub Bot commented on HADOOP-18596:
-

hadoop-yetus commented on PR #5308:
URL: https://github.com/apache/hadoop/pull/5308#issuecomment-1413886297

   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 56s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  1s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  1s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  45m 38s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   0m 31s |  |  trunk passed with JDK 
Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04  |
   | +1 :green_heart: |  compile  |   0m 26s |  |  trunk passed with JDK 
Private Build-1.8.0_352-8u352-ga-1~20.04-b08  |
   | +1 :green_heart: |  checkstyle  |   0m 28s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   0m 31s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   0m 31s |  |  trunk passed with JDK 
Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04  |
   | +1 :green_heart: |  javadoc  |   0m 24s |  |  trunk passed with JDK 
Private Build-1.8.0_352-8u352-ga-1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   0m 55s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  25m 59s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 30s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 24s |  |  the patch passed with JDK 
Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04  |
   | +1 :green_heart: |  javac  |   0m 24s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 22s |  |  the patch passed with JDK 
Private Build-1.8.0_352-8u352-ga-1~20.04-b08  |
   | +1 :green_heart: |  javac  |   0m 22s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   0m 14s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   0m 23s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 19s |  |  the patch passed with JDK 
Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04  |
   | +1 :green_heart: |  javadoc  |   0m 17s |  |  the patch passed with JDK 
Private Build-1.8.0_352-8u352-ga-1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   0m 48s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  26m 29s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |  46m 10s |  |  hadoop-distcp in the patch 
passed.  |
   | +1 :green_heart: |  asflicense  |   0m 33s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 154m 19s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.42 ServerAPI=1.42 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5308/5/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/5308 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux 2a9835850fa1 4.15.0-200-generic #211-Ubuntu SMP Thu Nov 24 
18:16:04 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 8c427bd8a37f76e61bf6c8d0d0e7cb364d3a9344 |
   | Default Java | Private Build-1.8.0_352-8u352-ga-1~20.04-b08 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 
/usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_352-8u352-ga-1~20.04-b08 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5308/5/testReport/ |
   | Max. process+thread count | 596 (vs. ulimit of 5500) |
   | modules | C: hadoop-tools/hadoop-distcp U: hadoop-tools/hadoop-distcp |
   | Console output | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5308/5/console |
   | versions | git=2.25.1 maven=3.6.3 spotbugs=4.2.2 |
   | Powered by | Apache Yetus 0.14.0 https://yetus.apache.org |
   
   
   This message was automatically generated.
   
   




> Distcp -update between different cloud stores to use modification time while 
> checking for 

[jira] [Commented] (HADOOP-18596) Distcp -update between different cloud stores to use modification time while checking for file skip.

2023-02-02 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17683260#comment-17683260
 ] 

ASF GitHub Bot commented on HADOOP-18596:
-

mehakmeet commented on code in PR #5308:
URL: https://github.com/apache/hadoop/pull/5308#discussion_r1094170778


##
hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/contract/AbstractContractDistCpTest.java:
##
@@ -50,6 +50,7 @@
 import org.apache.hadoop.tools.mapred.CopyMapper;
 import org.apache.hadoop.tools.util.DistCpTestUtils;
 import org.apache.hadoop.util.functional.RemoteIterators;
+import org.apache.http.annotation.Contract;

Review Comment:
   unused import. removing this and another one caught in the checkstyle test





> Distcp -update between different cloud stores to use modification time while 
> checking for file skip.
> 
>
> Key: HADOOP-18596
> URL: https://issues.apache.org/jira/browse/HADOOP-18596
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: tools/distcp
>Reporter: Mehakmeet Singh
>Assignee: Mehakmeet Singh
>Priority: Major
>  Labels: pull-request-available
>
> Distcp -update currently relies on File size, block size, and Checksum 
> comparisons to figure out which files should be skipped or copied. 
> Since different cloud stores have different checksum algorithms we should 
> check for modification time as well to the checks.
> This would ensure that while performing -update if the files are perceived to 
> be out of sync we should copy them. The machines between which the file 
> transfers occur should be in time sync to avoid any extra copies.
> Improving testing and documentation for modification time checks between 
> different object stores to ensure no incorrect skipping of files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-18596) Distcp -update between different cloud stores to use modification time while checking for file skip.

2023-02-01 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17683091#comment-17683091
 ] 

ASF GitHub Bot commented on HADOOP-18596:
-

steveloughran commented on code in PR #5308:
URL: https://github.com/apache/hadoop/pull/5308#discussion_r1093469149


##
hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/contract/AbstractContractDistCpTest.java:
##
@@ -868,35 +892,41 @@ public void testDistCpUpdateCheckFileSkip() throws 
Exception {
 
 Path source = new Path(remoteDir, "file");
 Path dest = new Path(localDir, "file");
+
+Path source_0byte = new Path(remoteDir, "file_0byte");
+Path dest_0byte = new Path(localDir, "file_0byte");
 dest = localFS.makeQualified(dest);
+dest_0byte = localFS.makeQualified(dest_0byte);
 
 // Creating a source file with certain dataset.
 byte[] sourceBlock = dataset(10, 'a', 'z');
 
 // Write the dataset and as well create the target path.
-try (FSDataOutputStream out = remoteFS.create(source)) {
-  out.write(sourceBlock);
-  localFS.create(dest);
-}
+ContractTestUtils.createFile(localFS, dest, true, sourceBlock);
+ContractTestUtils
+.writeDataset(remoteFS, source, sourceBlock, sourceBlock.length,
+1024, true);
 
-verifyPathExists(remoteFS, "", source);
-verifyPathExists(localFS, "", dest);
-DistCpTestUtils
-.assertRunDistCp(DistCpConstants.SUCCESS, remoteDir.toString(),
-localDir.toString(), "-delete -update" + getDefaultCLIOptions(),
-conf);
+// Create 0 byte source and target files.
+ContractTestUtils.createFile(remoteFS, source_0byte, true, new byte[0]);
+ContractTestUtils.createFile(localFS, dest_0byte, true, new byte[0]);
+
+// Execute the distcp -update job.
+Job job = distCpUpdateWithFs(remoteDir, localDir, remoteFS, localFS);
 
 // First distcp -update would normally copy the source to dest.
 verifyFileContents(localFS, dest, sourceBlock);
+// Verify 1 file was skipped in the distcp -update(They 0 byte files).

Review Comment:
   nit: add space after update and replace They with `the`



##
hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/contract/AbstractContractDistCpTest.java:
##
@@ -50,6 +50,7 @@
 import org.apache.hadoop.tools.mapred.CopyMapper;
 import org.apache.hadoop.tools.util.DistCpTestUtils;
 import org.apache.hadoop.util.functional.RemoteIterators;
+import org.apache.http.annotation.Contract;

Review Comment:
   where does this come from/get used?



##
hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/contract/AbstractContractDistCpTest.java:
##
@@ -913,32 +943,65 @@ public void testDistCpUpdateCheckFileSkip() throws 
Exception {
 long newTargetModTimeNew = modTimeSourceUpd + MODIFICATION_TIME_OFFSET;
 localFS.setTimes(dest, newTargetModTimeNew, -1);
 
-DistCpTestUtils
-.assertRunDistCp(DistCpConstants.SUCCESS, remoteDir.toString(),
-localDir.toString(), "-delete -update" + getDefaultCLIOptions(),
-conf);
+// Execute the distcp -update job.
+Job updatedSourceJobOldSrc =
+distCpUpdateWithFs(remoteDir, localDir, remoteFS,
+localFS);
 
 // File contents should remain same since the mod time for target is
 // newer than the updatedSource which indicates that the sync happened
 // more recently and there is no update.
 verifyFileContents(localFS, dest, sourceBlock);
+// Skipped both 0 byte file and sourceFile(since mod time of target is
+// older than the source it is perceived that source is of older version
+// and we can skip it's copy).
+verifySkipAndCopyCounter(updatedSourceJobOldSrc, 2, 0);
 
 // Subtract by an offset which would ensure enough gap for the test to
 // not fail due to race conditions.
 long newTargetModTimeOld =
 Math.min(modTimeSourceUpd - MODIFICATION_TIME_OFFSET, 0);
 localFS.setTimes(dest, newTargetModTimeOld, -1);
 
-DistCpTestUtils
-.assertRunDistCp(DistCpConstants.SUCCESS, remoteDir.toString(),
-localDir.toString(), "-delete -update" + getDefaultCLIOptions(),
-conf);
+// Execute the distcp -update job.
+Job updatedSourceJobNewSrc = distCpUpdateWithFs(remoteDir, localDir,
+remoteFS,
+localFS);
 
-Assertions.assertThat(RemoteIterators.toList(localFS.listFiles(dest, 
true)))
-.hasSize(1);
+// Verifying the target directory have both 0 byte file and the content
+// file.
+Assertions
+.assertThat(RemoteIterators.toList(localFS.listFiles(localDir, true)))
+.hasSize(2);
 // Now the copy should take place and the file contents should change
 // since the mod time for target is older than the source file indicating
 // that there was an update to the 

[jira] [Commented] (HADOOP-18596) Distcp -update between different cloud stores to use modification time while checking for file skip.

2023-02-01 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17683090#comment-17683090
 ] 

ASF GitHub Bot commented on HADOOP-18596:
-

steveloughran commented on code in PR #5308:
URL: https://github.com/apache/hadoop/pull/5308#discussion_r1093464483


##
hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/contract/AbstractContractDistCpTest.java:
##
@@ -857,4 +862,83 @@ public void testDistCpWithUpdateExistFile() throws 
Exception {
 verifyFileContents(localFS, dest, block);
   }
 
+  @Test
+  public void testDistCpUpdateCheckFileSkip() throws Exception {

Review Comment:
   oh, I didn't believe that was the case. I see the javadocs for it say we 
just skip the big file tests. i'm happy





> Distcp -update between different cloud stores to use modification time while 
> checking for file skip.
> 
>
> Key: HADOOP-18596
> URL: https://issues.apache.org/jira/browse/HADOOP-18596
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: tools/distcp
>Reporter: Mehakmeet Singh
>Assignee: Mehakmeet Singh
>Priority: Major
>  Labels: pull-request-available
>
> Distcp -update currently relies on File size, block size, and Checksum 
> comparisons to figure out which files should be skipped or copied. 
> Since different cloud stores have different checksum algorithms we should 
> check for modification time as well to the checks.
> This would ensure that while performing -update if the files are perceived to 
> be out of sync we should copy them. The machines between which the file 
> transfers occur should be in time sync to avoid any extra copies.
> Improving testing and documentation for modification time checks between 
> different object stores to ensure no incorrect skipping of files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-18596) Distcp -update between different cloud stores to use modification time while checking for file skip.

2023-02-01 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17683087#comment-17683087
 ] 

ASF GitHub Bot commented on HADOOP-18596:
-

steveloughran commented on code in PR #5308:
URL: https://github.com/apache/hadoop/pull/5308#discussion_r1093465515


##
hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCpConstants.java:
##
@@ -142,6 +142,13 @@ private DistCpConstants() {
   "distcp.blocks.per.chunk";
 
   public static final String CONF_LABEL_USE_ITERATOR = "distcp.use.iterator";
+
+  /** Distcp -update to use modification time of source and target file to
+   * check while skipping.
+   */
+  public static final String CONF_LABEL_UPDATE_MOD_TIME =

Review Comment:
   1. javadoc to include {@value}
   2. distcp docs to explain the option





> Distcp -update between different cloud stores to use modification time while 
> checking for file skip.
> 
>
> Key: HADOOP-18596
> URL: https://issues.apache.org/jira/browse/HADOOP-18596
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: tools/distcp
>Reporter: Mehakmeet Singh
>Assignee: Mehakmeet Singh
>Priority: Major
>  Labels: pull-request-available
>
> Distcp -update currently relies on File size, block size, and Checksum 
> comparisons to figure out which files should be skipped or copied. 
> Since different cloud stores have different checksum algorithms we should 
> check for modification time as well to the checks.
> This would ensure that while performing -update if the files are perceived to 
> be out of sync we should copy them. The machines between which the file 
> transfers occur should be in time sync to avoid any extra copies.
> Improving testing and documentation for modification time checks between 
> different object stores to ensure no incorrect skipping of files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-18596) Distcp -update between different cloud stores to use modification time while checking for file skip.

2023-02-01 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17683085#comment-17683085
 ] 

ASF GitHub Bot commented on HADOOP-18596:
-

steveloughran commented on code in PR #5308:
URL: https://github.com/apache/hadoop/pull/5308#discussion_r1093464483


##
hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/contract/AbstractContractDistCpTest.java:
##
@@ -857,4 +862,83 @@ public void testDistCpWithUpdateExistFile() throws 
Exception {
 verifyFileContents(localFS, dest, block);
   }
 
+  @Test
+  public void testDistCpUpdateCheckFileSkip() throws Exception {

Review Comment:
   oh, I didn't believe that was the case. I see the Dox say we just skipped 
the big files there. If that test is happy then so am I.





> Distcp -update between different cloud stores to use modification time while 
> checking for file skip.
> 
>
> Key: HADOOP-18596
> URL: https://issues.apache.org/jira/browse/HADOOP-18596
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: tools/distcp
>Reporter: Mehakmeet Singh
>Assignee: Mehakmeet Singh
>Priority: Major
>  Labels: pull-request-available
>
> Distcp -update currently relies on File size, block size, and Checksum 
> comparisons to figure out which files should be skipped or copied. 
> Since different cloud stores have different checksum algorithms we should 
> check for modification time as well to the checks.
> This would ensure that while performing -update if the files are perceived to 
> be out of sync we should copy them. The machines between which the file 
> transfers occur should be in time sync to avoid any extra copies.
> Improving testing and documentation for modification time checks between 
> different object stores to ensure no incorrect skipping of files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-18596) Distcp -update between different cloud stores to use modification time while checking for file skip.

2023-02-01 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17683009#comment-17683009
 ] 

ASF GitHub Bot commented on HADOOP-18596:
-

hadoop-yetus commented on PR #5308:
URL: https://github.com/apache/hadoop/pull/5308#issuecomment-1412003628

   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |  17m 41s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  48m 51s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   0m 30s |  |  trunk passed with JDK 
Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04  |
   | +1 :green_heart: |  compile  |   0m 26s |  |  trunk passed with JDK 
Private Build-1.8.0_352-8u352-ga-1~20.04-b08  |
   | +1 :green_heart: |  checkstyle  |   0m 26s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   0m 31s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   0m 31s |  |  trunk passed with JDK 
Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04  |
   | +1 :green_heart: |  javadoc  |   0m 24s |  |  trunk passed with JDK 
Private Build-1.8.0_352-8u352-ga-1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   0m 55s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  25m 40s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 30s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 24s |  |  the patch passed with JDK 
Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04  |
   | +1 :green_heart: |  javac  |   0m 24s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 20s |  |  the patch passed with JDK 
Private Build-1.8.0_352-8u352-ga-1~20.04-b08  |
   | +1 :green_heart: |  javac  |   0m 20s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | -0 :warning: |  checkstyle  |   0m 14s | 
[/results-checkstyle-hadoop-tools_hadoop-distcp.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5308/4/artifact/out/results-checkstyle-hadoop-tools_hadoop-distcp.txt)
 |  hadoop-tools/hadoop-distcp: The patch generated 4 new + 14 unchanged - 0 
fixed = 18 total (was 14)  |
   | +1 :green_heart: |  mvnsite  |   0m 23s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 18s |  |  the patch passed with JDK 
Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04  |
   | +1 :green_heart: |  javadoc  |   0m 16s |  |  the patch passed with JDK 
Private Build-1.8.0_352-8u352-ga-1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   0m 49s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  26m  6s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |  46m  9s |  |  hadoop-distcp in the patch 
passed.  |
   | +1 :green_heart: |  asflicense  |   0m 33s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 173m 21s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5308/4/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/5308 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux e05c0db28e77 4.15.0-200-generic #211-Ubuntu SMP Thu Nov 24 
18:16:04 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / f094248e19b615310d4791e31d293b88ed6bde37 |
   | Default Java | Private Build-1.8.0_352-8u352-ga-1~20.04-b08 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 
/usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_352-8u352-ga-1~20.04-b08 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5308/4/testReport/ |
   | Max. process+thread count | 538 (vs. ulimit of 5500) |
   | modules | C: hadoop-tools/hadoop-distcp U: hadoop-tools/hadoop-distcp |
   | Console output | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5308/4/console |
   | versions | 

[jira] [Commented] (HADOOP-18596) Distcp -update between different cloud stores to use modification time while checking for file skip.

2023-02-01 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17682933#comment-17682933
 ] 

ASF GitHub Bot commented on HADOOP-18596:
-

mehakmeet commented on code in PR #5308:
URL: https://github.com/apache/hadoop/pull/5308#discussion_r1092974460


##
hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/contract/AbstractContractDistCpTest.java:
##
@@ -857,4 +862,83 @@ public void testDistCpWithUpdateExistFile() throws 
Exception {
 verifyFileContents(localFS, dest, block);
   }
 
+  @Test
+  public void testDistCpUpdateCheckFileSkip() throws Exception {

Review Comment:
   There is `TestHDFSContractDistCp.java` that does implement this suite. I 
have run it before and it works successfully. Let me know if we still need to 
add another one specifically for hdfs.





> Distcp -update between different cloud stores to use modification time while 
> checking for file skip.
> 
>
> Key: HADOOP-18596
> URL: https://issues.apache.org/jira/browse/HADOOP-18596
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: tools/distcp
>Reporter: Mehakmeet Singh
>Assignee: Mehakmeet Singh
>Priority: Major
>  Labels: pull-request-available
>
> Distcp -update currently relies on File size, block size, and Checksum 
> comparisons to figure out which files should be skipped or copied. 
> Since different cloud stores have different checksum algorithms we should 
> check for modification time as well to the checks.
> This would ensure that while performing -update if the files are perceived to 
> be out of sync we should copy them. The machines between which the file 
> transfers occur should be in time sync to avoid any extra copies.
> Improving testing and documentation for modification time checks between 
> different object stores to ensure no incorrect skipping of files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-18596) Distcp -update between different cloud stores to use modification time while checking for file skip.

2023-02-01 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17682900#comment-17682900
 ] 

ASF GitHub Bot commented on HADOOP-18596:
-

mehakmeet commented on code in PR #5308:
URL: https://github.com/apache/hadoop/pull/5308#discussion_r1092876062


##
hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCpConstants.java:
##
@@ -142,6 +142,13 @@ private DistCpConstants() {
   "distcp.blocks.per.chunk";
 
   public static final String CONF_LABEL_USE_ITERATOR = "distcp.use.iterator";
+
+  /** Distcp -update to use modification time of source and target file to
+   * check while skipping.
+   */
+  public static final String CONF_LABEL_UPDATE_MOD_TIME =

Review Comment:
   Not sure if you mean ".md" file modification or the modification to already 
added java docs? I'll add some more info to the javadocs about the constant.





> Distcp -update between different cloud stores to use modification time while 
> checking for file skip.
> 
>
> Key: HADOOP-18596
> URL: https://issues.apache.org/jira/browse/HADOOP-18596
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: tools/distcp
>Reporter: Mehakmeet Singh
>Assignee: Mehakmeet Singh
>Priority: Major
>  Labels: pull-request-available
>
> Distcp -update currently relies on File size, block size, and Checksum 
> comparisons to figure out which files should be skipped or copied. 
> Since different cloud stores have different checksum algorithms we should 
> check for modification time as well to the checks.
> This would ensure that while performing -update if the files are perceived to 
> be out of sync we should copy them. The machines between which the file 
> transfers occur should be in time sync to avoid any extra copies.
> Improving testing and documentation for modification time checks between 
> different object stores to ensure no incorrect skipping of files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-18596) Distcp -update between different cloud stores to use modification time while checking for file skip.

2023-01-30 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17682172#comment-17682172
 ] 

ASF GitHub Bot commented on HADOOP-18596:
-

steveloughran commented on code in PR #5308:
URL: https://github.com/apache/hadoop/pull/5308#discussion_r1090843134


##
hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCpConstants.java:
##
@@ -142,6 +142,13 @@ private DistCpConstants() {
   "distcp.blocks.per.chunk";
 
   public static final String CONF_LABEL_USE_ITERATOR = "distcp.use.iterator";
+
+  /** Distcp -update to use modification time of source and target file to
+   * check while skipping.
+   */
+  public static final String CONF_LABEL_UPDATE_MOD_TIME =

Review Comment:
   going to need docs.
   



##
hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/contract/AbstractContractDistCpTest.java:
##
@@ -857,4 +862,83 @@ public void testDistCpWithUpdateExistFile() throws 
Exception {
 verifyFileContents(localFS, dest, block);
   }
 
+  @Test
+  public void testDistCpUpdateCheckFileSkip() throws Exception {

Review Comment:
   I'm thinking of a way to test that 0 byte files don't get copied
   
   the testUpdateDeepDirectoryStructureNoChange() test shows how the counters 
are used for validation. The new test should validate the files are skipped as 
well as checking the contents.
   
   That should be usable to verify that 0 byte files are always skipped, 
something we can't do with content validation.
   
   one thing to be aware of is that this test suite isn't implemented by hdfs, 
because of the way it creates a new fs every call is too slow. there should be 
some specific hdfs to local test we should cover too.
   
   



##
hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/contract/AbstractContractDistCpTest.java:
##
@@ -857,4 +862,83 @@ public void testDistCpWithUpdateExistFile() throws 
Exception {
 verifyFileContents(localFS, dest, block);
   }
 
+  @Test
+  public void testDistCpUpdateCheckFileSkip() throws Exception {
+describe("Distcp update to check file skips.");
+
+Path source = new Path(remoteDir, "file");
+Path dest = new Path(localDir, "file");
+dest = localFS.makeQualified(dest);
+
+// Creating a source file with certain dataset.
+byte[] sourceBlock = dataset(10, 'a', 'z');
+
+// Write the dataset and as well create the target path.
+try (FSDataOutputStream out = remoteFS.create(source)) {

Review Comment:
   if you use `ContractTestUtils.writeDataset()` here the write is followed by 
the check that the file is of the correct length; L882 just looks for existence





> Distcp -update between different cloud stores to use modification time while 
> checking for file skip.
> 
>
> Key: HADOOP-18596
> URL: https://issues.apache.org/jira/browse/HADOOP-18596
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: tools/distcp
>Reporter: Mehakmeet Singh
>Assignee: Mehakmeet Singh
>Priority: Major
>  Labels: pull-request-available
>
> Distcp -update currently relies on File size, block size, and Checksum 
> comparisons to figure out which files should be skipped or copied. 
> Since different cloud stores have different checksum algorithms we should 
> check for modification time as well to the checks.
> This would ensure that while performing -update if the files are perceived to 
> be out of sync we should copy them. The machines between which the file 
> transfers occur should be in time sync to avoid any extra copies.
> Improving testing and documentation for modification time checks between 
> different object stores to ensure no incorrect skipping of files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-18596) Distcp -update between different cloud stores to use modification time while checking for file skip.

2023-01-30 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17682021#comment-17682021
 ] 

ASF GitHub Bot commented on HADOOP-18596:
-

hadoop-yetus commented on PR #5308:
URL: https://github.com/apache/hadoop/pull/5308#issuecomment-1408457069

   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 57s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  1s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  1s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  46m 19s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   0m 31s |  |  trunk passed with JDK 
Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04  |
   | +1 :green_heart: |  compile  |   0m 27s |  |  trunk passed with JDK 
Private Build-1.8.0_352-8u352-ga-1~20.04-b08  |
   | +1 :green_heart: |  checkstyle  |   0m 27s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   0m 31s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   0m 31s |  |  trunk passed with JDK 
Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04  |
   | +1 :green_heart: |  javadoc  |   0m 25s |  |  trunk passed with JDK 
Private Build-1.8.0_352-8u352-ga-1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   0m 56s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  26m  4s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 30s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 24s |  |  the patch passed with JDK 
Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04  |
   | +1 :green_heart: |  javac  |   0m 24s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 21s |  |  the patch passed with JDK 
Private Build-1.8.0_352-8u352-ga-1~20.04-b08  |
   | +1 :green_heart: |  javac  |   0m 21s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   0m 14s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   0m 23s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 18s |  |  the patch passed with JDK 
Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04  |
   | +1 :green_heart: |  javadoc  |   0m 17s |  |  the patch passed with JDK 
Private Build-1.8.0_352-8u352-ga-1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   0m 50s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  26m  4s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |  46m  3s |  |  hadoop-distcp in the patch 
passed.  |
   | +1 :green_heart: |  asflicense  |   0m 32s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 154m 24s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5308/3/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/5308 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux 33f023938bd4 4.15.0-200-generic #211-Ubuntu SMP Thu Nov 24 
18:16:04 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / d23f13be729f0fe0770ce856b4f676f6782d83b9 |
   | Default Java | Private Build-1.8.0_352-8u352-ga-1~20.04-b08 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 
/usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_352-8u352-ga-1~20.04-b08 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5308/3/testReport/ |
   | Max. process+thread count | 540 (vs. ulimit of 5500) |
   | modules | C: hadoop-tools/hadoop-distcp U: hadoop-tools/hadoop-distcp |
   | Console output | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5308/3/console |
   | versions | git=2.25.1 maven=3.6.3 spotbugs=4.2.2 |
   | Powered by | Apache Yetus 0.14.0 https://yetus.apache.org |
   
   
   This message was automatically generated.
   
   




> Distcp -update between different cloud stores to use modification time while 
> checking for 

[jira] [Commented] (HADOOP-18596) Distcp -update between different cloud stores to use modification time while checking for file skip.

2023-01-23 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17679756#comment-17679756
 ] 

ASF GitHub Bot commented on HADOOP-18596:
-

hadoop-yetus commented on PR #5308:
URL: https://github.com/apache/hadoop/pull/5308#issuecomment-1400145282

   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 59s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  1s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  46m 28s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   0m 31s |  |  trunk passed with JDK 
Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04  |
   | +1 :green_heart: |  compile  |   0m 27s |  |  trunk passed with JDK 
Private Build-1.8.0_352-8u352-ga-1~20.04-b08  |
   | +1 :green_heart: |  checkstyle  |   0m 28s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   0m 32s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   0m 32s |  |  trunk passed with JDK 
Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04  |
   | +1 :green_heart: |  javadoc  |   0m 24s |  |  trunk passed with JDK 
Private Build-1.8.0_352-8u352-ga-1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   0m 57s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  26m 31s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 31s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 24s |  |  the patch passed with JDK 
Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04  |
   | +1 :green_heart: |  javac  |   0m 24s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 22s |  |  the patch passed with JDK 
Private Build-1.8.0_352-8u352-ga-1~20.04-b08  |
   | +1 :green_heart: |  javac  |   0m 22s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   0m 15s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   0m 23s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 19s |  |  the patch passed with JDK 
Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04  |
   | +1 :green_heart: |  javadoc  |   0m 17s |  |  the patch passed with JDK 
Private Build-1.8.0_352-8u352-ga-1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   0m 48s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  25m 58s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |  45m 26s |  |  hadoop-distcp in the patch 
passed.  |
   | +1 :green_heart: |  asflicense  |   0m 33s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 154m 21s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5308/2/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/5308 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux 96f6b364be34 4.15.0-200-generic #211-Ubuntu SMP Thu Nov 24 
18:16:04 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / ee9a8568ae9e97cb94a05bbe1b2191811e0d45ee |
   | Default Java | Private Build-1.8.0_352-8u352-ga-1~20.04-b08 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 
/usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_352-8u352-ga-1~20.04-b08 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5308/2/testReport/ |
   | Max. process+thread count | 616 (vs. ulimit of 5500) |
   | modules | C: hadoop-tools/hadoop-distcp U: hadoop-tools/hadoop-distcp |
   | Console output | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5308/2/console |
   | versions | git=2.25.1 maven=3.6.3 spotbugs=4.2.2 |
   | Powered by | Apache Yetus 0.14.0 https://yetus.apache.org |
   
   
   This message was automatically generated.
   
   




> Distcp -update between different cloud stores to use modification time while 
> checking for 

[jira] [Commented] (HADOOP-18596) Distcp -update between different cloud stores to use modification time while checking for file skip.

2023-01-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17678783#comment-17678783
 ] 

ASF GitHub Bot commented on HADOOP-18596:
-

mehakmeet commented on code in PR #5308:
URL: https://github.com/apache/hadoop/pull/5308#discussion_r1081471243


##
hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/CopyMapper.java:
##
@@ -354,7 +354,14 @@ private boolean canSkip(FileSystem sourceFS, 
CopyListingFileStatus source,
 boolean sameLength = target.getLen() == source.getLen();
 boolean sameBlockSize = source.getBlockSize() == target.getBlockSize()
 || !preserve.contains(FileAttribute.BLOCKSIZE);
-if (sameLength && sameBlockSize) {
+// checksum check to be done if same file len(greater than 0), same block
+// size and the target file has been updated more recently than the source
+// file.
+// Note: For Different cloud stores with different checksum algorithms,
+// checksum comparisons are not performed so we would be depending on the
+// file size and modification time.
+if (sameLength && (source.getLen() > 0) && sameBlockSize &&
+source.getModificationTime() < target.getModificationTime()) {

Review Comment:
   Ah, I actually had to add a check of if the file size is 0 to skip it every 
time before this check, forgot to add it in this version locally . Good catch. 





> Distcp -update between different cloud stores to use modification time while 
> checking for file skip.
> 
>
> Key: HADOOP-18596
> URL: https://issues.apache.org/jira/browse/HADOOP-18596
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: tools/distcp
>Reporter: Mehakmeet Singh
>Assignee: Mehakmeet Singh
>Priority: Major
>  Labels: pull-request-available
>
> Distcp -update currently relies on File size, block size, and Checksum 
> comparisons to figure out which files should be skipped or copied. 
> Since different cloud stores have different checksum algorithms we should 
> check for modification time as well to the checks.
> This would ensure that while performing -update if the files are perceived to 
> be out of sync we should copy them. The machines between which the file 
> transfers occur should be in time sync to avoid any extra copies.
> Improving testing and documentation for modification time checks between 
> different object stores to ensure no incorrect skipping of files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-18596) Distcp -update between different cloud stores to use modification time while checking for file skip.

2023-01-19 Thread Daniel Carl Jones (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17678780#comment-17678780
 ] 

Daniel Carl Jones commented on HADOOP-18596:


{quote}What Mehakmeet proposes is possible, doesn't add any risk of reduced 
copy (only increased copies) and fairly easy to test.
{quote}
So long as we meet this, i.e. we only potentially cause more files to be 
included in the update, then this change seems fine. Some users may find more 
files being copied than usual, but they are already exposed to the risk of 
newer safe length files not being copied when they should have been - will 
communicating this bug fix in change notes be enough?
{quote}We should look out that there shouldn't be a massive difference between 
the clocks so that the updation of the source files from one version to another 
should be more recent than the previous version being synced to cloud storage 
for example.
{quote}
Related to this - any way we can have DistCp abort the copy if it detects the 
source and destination are drifted beyond some acceptable threshold? Perhaps a 
separate Jira if it is a feasible check to add.

> Distcp -update between different cloud stores to use modification time while 
> checking for file skip.
> 
>
> Key: HADOOP-18596
> URL: https://issues.apache.org/jira/browse/HADOOP-18596
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: tools/distcp
>Reporter: Mehakmeet Singh
>Assignee: Mehakmeet Singh
>Priority: Major
>  Labels: pull-request-available
>
> Distcp -update currently relies on File size, block size, and Checksum 
> comparisons to figure out which files should be skipped or copied. 
> Since different cloud stores have different checksum algorithms we should 
> check for modification time as well to the checks.
> This would ensure that while performing -update if the files are perceived to 
> be out of sync we should copy them. The machines between which the file 
> transfers occur should be in time sync to avoid any extra copies.
> Improving testing and documentation for modification time checks between 
> different object stores to ensure no incorrect skipping of files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-18596) Distcp -update between different cloud stores to use modification time while checking for file skip.

2023-01-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17678768#comment-17678768
 ] 

ASF GitHub Bot commented on HADOOP-18596:
-

dannycjones commented on code in PR #5308:
URL: https://github.com/apache/hadoop/pull/5308#discussion_r1081452359


##
hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/CopyMapper.java:
##
@@ -354,7 +354,14 @@ private boolean canSkip(FileSystem sourceFS, 
CopyListingFileStatus source,
 boolean sameLength = target.getLen() == source.getLen();
 boolean sameBlockSize = source.getBlockSize() == target.getBlockSize()
 || !preserve.contains(FileAttribute.BLOCKSIZE);
-if (sameLength && sameBlockSize) {
+// checksum check to be done if same file len(greater than 0), same block
+// size and the target file has been updated more recently than the source
+// file.
+// Note: For Different cloud stores with different checksum algorithms,
+// checksum comparisons are not performed so we would be depending on the
+// file size and modification time.
+if (sameLength && (source.getLen() > 0) && sameBlockSize &&
+source.getModificationTime() < target.getModificationTime()) {

Review Comment:
   Why the addition of the `getLen() > 0`? We want to always copy if its an 
empty file?





> Distcp -update between different cloud stores to use modification time while 
> checking for file skip.
> 
>
> Key: HADOOP-18596
> URL: https://issues.apache.org/jira/browse/HADOOP-18596
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: tools/distcp
>Reporter: Mehakmeet Singh
>Assignee: Mehakmeet Singh
>Priority: Major
>  Labels: pull-request-available
>
> Distcp -update currently relies on File size, block size, and Checksum 
> comparisons to figure out which files should be skipped or copied. 
> Since different cloud stores have different checksum algorithms we should 
> check for modification time as well to the checks.
> This would ensure that while performing -update if the files are perceived to 
> be out of sync we should copy them. The machines between which the file 
> transfers occur should be in time sync to avoid any extra copies.
> Improving testing and documentation for modification time checks between 
> different object stores to ensure no incorrect skipping of files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-18596) Distcp -update between different cloud stores to use modification time while checking for file skip.

2023-01-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17678223#comment-17678223
 ] 

ASF GitHub Bot commented on HADOOP-18596:
-

hadoop-yetus commented on PR #5308:
URL: https://github.com/apache/hadoop/pull/5308#issuecomment-1387042961

   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 54s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  46m 21s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   0m 30s |  |  trunk passed with JDK 
Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04  |
   | +1 :green_heart: |  compile  |   0m 26s |  |  trunk passed with JDK 
Private Build-1.8.0_352-8u352-ga-1~20.04-b08  |
   | +1 :green_heart: |  checkstyle  |   0m 27s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   0m 30s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   0m 31s |  |  trunk passed with JDK 
Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04  |
   | +1 :green_heart: |  javadoc  |   0m 25s |  |  trunk passed with JDK 
Private Build-1.8.0_352-8u352-ga-1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   0m 57s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  25m 59s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 31s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 25s |  |  the patch passed with JDK 
Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04  |
   | +1 :green_heart: |  javac  |   0m 25s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 20s |  |  the patch passed with JDK 
Private Build-1.8.0_352-8u352-ga-1~20.04-b08  |
   | +1 :green_heart: |  javac  |   0m 20s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | -0 :warning: |  checkstyle  |   0m 15s | 
[/results-checkstyle-hadoop-tools_hadoop-distcp.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5308/1/artifact/out/results-checkstyle-hadoop-tools_hadoop-distcp.txt)
 |  hadoop-tools/hadoop-distcp: The patch generated 4 new + 9 unchanged - 0 
fixed = 13 total (was 9)  |
   | +1 :green_heart: |  mvnsite  |   0m 24s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 19s |  |  the patch passed with JDK 
Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04  |
   | +1 :green_heart: |  javadoc  |   0m 17s |  |  the patch passed with JDK 
Private Build-1.8.0_352-8u352-ga-1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   0m 49s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  26m 15s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |  44m 54s |  |  hadoop-distcp in the patch 
passed.  |
   | +1 :green_heart: |  asflicense  |   0m 32s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 153m 18s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5308/1/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/5308 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux d9de88482ed7 4.15.0-200-generic #211-Ubuntu SMP Thu Nov 24 
18:16:04 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 5d5228db519f0cc615c4955ba36b9f3ee0572788 |
   | Default Java | Private Build-1.8.0_352-8u352-ga-1~20.04-b08 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 
/usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_352-8u352-ga-1~20.04-b08 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5308/1/testReport/ |
   | Max. process+thread count | 636 (vs. ulimit of 5500) |
   | modules | C: hadoop-tools/hadoop-distcp U: hadoop-tools/hadoop-distcp |
   | Console output | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5308/1/console |
   | versions | 

[jira] [Commented] (HADOOP-18596) Distcp -update between different cloud stores to use modification time while checking for file skip.

2023-01-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17678206#comment-17678206
 ] 

ASF GitHub Bot commented on HADOOP-18596:
-

mehakmeet commented on PR #5308:
URL: https://github.com/apache/hadoop/pull/5308#issuecomment-1386942938

   CC: @steveloughran @mukund-thakur 




> Distcp -update between different cloud stores to use modification time while 
> checking for file skip.
> 
>
> Key: HADOOP-18596
> URL: https://issues.apache.org/jira/browse/HADOOP-18596
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: tools/distcp
>Reporter: Mehakmeet Singh
>Assignee: Mehakmeet Singh
>Priority: Major
>  Labels: pull-request-available
>
> Distcp -update currently relies on File size, block size, and Checksum 
> comparisons to figure out which files should be skipped or copied. 
> Since different cloud stores have different checksum algorithms we should 
> check for modification time as well to the checks.
> This would ensure that while performing -update if the files are perceived to 
> be out of sync we should copy them. The machines between which the file 
> transfers occur should be in time sync to avoid any extra copies.
> Improving testing and documentation for modification time checks between 
> different object stores to ensure no incorrect skipping of files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-18596) Distcp -update between different cloud stores to use modification time while checking for file skip.

2023-01-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17678178#comment-17678178
 ] 

ASF GitHub Bot commented on HADOOP-18596:
-

mehakmeet opened a new pull request, #5308:
URL: https://github.com/apache/hadoop/pull/5308

   ### Description of PR
   Using modification time as a way to add more checks to determine if distcp 
-update should skip a file or not. 
   In specific cases like the same file name, and size but different content we 
used to incorrectly skip files in update since there is no checksum comparison 
between object stores with different algorithm for it, to mitigate this we 
introduce comparing modification time between the target file and the source.
   
   ### How was this patch tested?
   Manually tested on an environment after reproducing the scenario where we 
might incorrectly skip a file.
   Added a test in `AbstractContractDistCpTest.java` to test by changing the 
target file's modification time to emulate the scenario.
   
   Tested on S3A(ap-south-1), ABFS(us-west-2), and LocalFS, and the test was 
successful. 
   
   ### For code changes:
   
   - [X] Does the title or this PR starts with the corresponding JIRA issue id 
(e.g. 'HADOOP-17799. Your PR title ...')?
   - [X] Object storage: have the integration tests been executed and the 
endpoint declared according to the connector-specific documentation?
   - [X] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)?
   - [X] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, 
`NOTICE-binary` files?
   
   




> Distcp -update between different cloud stores to use modification time while 
> checking for file skip.
> 
>
> Key: HADOOP-18596
> URL: https://issues.apache.org/jira/browse/HADOOP-18596
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: tools/distcp
>Reporter: Mehakmeet Singh
>Assignee: Mehakmeet Singh
>Priority: Major
>
> Distcp -update currently relies on File size, block size, and Checksum 
> comparisons to figure out which files should be skipped or copied. 
> Since different cloud stores have different checksum algorithms we should 
> check for modification time as well to the checks.
> This would ensure that while performing -update if the files are perceived to 
> be out of sync we should copy them. The machines between which the file 
> transfers occur should be in time sync to avoid any extra copies.
> Improving testing and documentation for modification time checks between 
> different object stores to ensure no incorrect skipping of files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-18596) Distcp -update between different cloud stores to use modification time while checking for file skip.

2023-01-17 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17677789#comment-17677789
 ] 

Steve Loughran commented on HADOOP-18596:
-

note we already state in the docs that modtime i used. it's just the author of 
that paragraph (me) was misinformed.

expecting clocks to be in sync is unrealistic. at least the modtime is 
configured to to be UTC everywhere, so there's no time zone conversion 
-provided this requirement is met everywhere. I am confident the big cloud 
vendors get it right (our tests would probably have caught this by now), but 
private minio deployments may be misconfigured with both NTP and tz.


Mehakmeet's proposal will not cause copies which would not be skipped today to 
be skipped with the patch. what it will do is cause updates where the file 
length is the same to now be copied if source time > dest time. The worst case 
then is "not all updated files are detected".

note that this will also address cross-EZ copies better, because there the hdfs 
cluster will be in 100% sync. same for copies within the same s3/azure/gcs 
store but within the same fs uri or across containers/buckets/accounts.

The way to do this *properly* would be to log the checksum/etag of the sauce 
and update if that is different from the last upload. They'll be no need to 
check the destination at all, assuming the workflow is nothing but a chain of 
distcp++ jobs. Something like that would be a complete rewrite and I have no 
enthusiasm for that. FWIW I have played with using spark for a distcp successor 
 
https://github.com/hortonworks-spark/cloud-integration/blob/master/spark-cloud-integration/src/main/scala/com/cloudera/spark/cloud/applications/CloudCp.scala
If I was going to replace distcp I'd do it that way
* modern execution env for dynamically passing work around
* rate throttling across job (allocate capacity to each worker process, share 
that across all active threads)
* good view of progress
* could provide an API to take any RDD as a source of the list of files to 
upload
* IOStats can be collected and marshalled back from workers to driver
* generate avro summary of the update which can then be converted into human 
reports.

I'm not going to go there. One challenge is actually recovering from failure of 
the job as a complete restart would copy up all files for which the summary 
.avro file hasn't yet been generated. you'd actually want to commit the summary 
of each task attempt *in task commit* so that a new job would be able to pick 
it up and continue. mapreduce AM restart does this automatically, but not spark.

Then there's all the ideas from Apache Gobblin.

A distcp successor would be a massive undertaking and doesn't need to be in the 
hadoop modules. What Mehakmeet proposes is possible, doesn't add any risk of 
reduced copy (only increased copies) and fairly easy to test.

> Distcp -update between different cloud stores to use modification time while 
> checking for file skip.
> 
>
> Key: HADOOP-18596
> URL: https://issues.apache.org/jira/browse/HADOOP-18596
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: tools/distcp
>Reporter: Mehakmeet Singh
>Assignee: Mehakmeet Singh
>Priority: Major
>
> Distcp -update currently relies on File size, block size, and Checksum 
> comparisons to figure out which files should be skipped or copied. 
> Since different cloud stores have different checksum algorithms we should 
> check for modification time as well to the checks.
> This would ensure that while performing -update if the files are perceived to 
> be out of sync we should copy them. The machines between which the file 
> transfers occur should be in time sync to avoid any extra copies.
> Improving testing and documentation for modification time checks between 
> different object stores to ensure no incorrect skipping of files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-18596) Distcp -update between different cloud stores to use modification time while checking for file skip.

2023-01-17 Thread Mehakmeet Singh (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17677654#comment-17677654
 ] 

Mehakmeet Singh commented on HADOOP-18596:
--

{quote}How would you ensure that they are in sync, two clocks *perfectly* in 
sync is kind of looks tough.
{quote}
Good question. Although I am not sure if there's a way to have perfect time 
sync, I think that we can use NTP(it is already used widely) to minimize any 
time sync latency. Cloud service like AWS already has their internal time 
synchronization service and if we have NTP configured in the source machine, 
this should ensure that time is in sync between the two machines. Although not 
perfect, it should be enough for distcp -update to not skip the files 
incorrectly due to that.

We should look out that there shouldn't be a massive difference between the 
clocks so that the updation of the source files from one version to another 
should be more recent than the previous version being synced to cloud storage 
for example.
{quote}Do you plan to introduce an additional option for this or make it a 
default
{quote}
We are planning to have this by default since this adds more resilience to 
cases where checksum algorithms cannot be compared between different object 
stores.

 

CC: [~ste...@apache.org] [~mthakur] 

> Distcp -update between different cloud stores to use modification time while 
> checking for file skip.
> 
>
> Key: HADOOP-18596
> URL: https://issues.apache.org/jira/browse/HADOOP-18596
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: tools/distcp
>Reporter: Mehakmeet Singh
>Assignee: Mehakmeet Singh
>Priority: Major
>
> Distcp -update currently relies on File size, block size, and Checksum 
> comparisons to figure out which files should be skipped or copied. 
> Since different cloud stores have different checksum algorithms we should 
> check for modification time as well to the checks.
> This would ensure that while performing -update if the files are perceived to 
> be out of sync we should copy them. The machines between which the file 
> transfers occur should be in time sync to avoid any extra copies.
> Improving testing and documentation for modification time checks between 
> different object stores to ensure no incorrect skipping of files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-18596) Distcp -update between different cloud stores to use modification time while checking for file skip.

2023-01-16 Thread Ayush Saxena (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17677623#comment-17677623
 ] 

Ayush Saxena commented on HADOOP-18596:
---

Just bumped into it while checking my mail and have a question here:
{quote}The machines between which the file transfers occur should be in time 
sync to avoid any extra copies.
{quote}
How would you ensure that they are in sync, two clocks *perfectly* in sync is 
kind of looks tough.

Extra copy ain't an issue, it will be hitting the performance only, not copying 
and having stale data is an issue. 

Do you plan to introduce an additional option for this or make it a default

> Distcp -update between different cloud stores to use modification time while 
> checking for file skip.
> 
>
> Key: HADOOP-18596
> URL: https://issues.apache.org/jira/browse/HADOOP-18596
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: tools/distcp
>Reporter: Mehakmeet Singh
>Assignee: Mehakmeet Singh
>Priority: Major
>
> Distcp -update currently relies on File size, block size, and Checksum 
> comparisons to figure out which files should be skipped or copied. 
> Since different cloud stores have different checksum algorithms we should 
> check for modification time as well to the checks.
> This would ensure that while performing -update if the files are perceived to 
> be out of sync we should copy them. The machines between which the file 
> transfers occur should be in time sync to avoid any extra copies.
> Improving testing and documentation for modification time checks between 
> different object stores to ensure no incorrect skipping of files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org