[
https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17678780#comment-17678780
]
Daniel Carl Jones commented on HADOOP-18596:
--------------------------------------------
{quote}What Mehakmeet proposes is possible, doesn't add any risk of reduced
copy (only increased copies) and fairly easy to test.
{quote}
So long as we meet this, i.e. we only potentially cause more files to be
included in the update, then this change seems fine. Some users may find more
files being copied than usual, but they are already exposed to the risk of
newer safe length files not being copied when they should have been - will
communicating this bug fix in change notes be enough?
{quote}We should look out that there shouldn't be a massive difference between
the clocks so that the updation of the source files from one version to another
should be more recent than the previous version being synced to cloud storage
for example.
{quote}
Related to this - any way we can have DistCp abort the copy if it detects the
source and destination are drifted beyond some acceptable threshold? Perhaps a
separate Jira if it is a feasible check to add.
> Distcp -update between different cloud stores to use modification time while
> checking for file skip.
> ----------------------------------------------------------------------------------------------------
>
> Key: HADOOP-18596
> URL: https://issues.apache.org/jira/browse/HADOOP-18596
> Project: Hadoop Common
> Issue Type: Improvement
> Components: tools/distcp
> Reporter: Mehakmeet Singh
> Assignee: Mehakmeet Singh
> Priority: Major
> Labels: pull-request-available
>
> Distcp -update currently relies on File size, block size, and Checksum
> comparisons to figure out which files should be skipped or copied.
> Since different cloud stores have different checksum algorithms we should
> check for modification time as well to the checks.
> This would ensure that while performing -update if the files are perceived to
> be out of sync we should copy them. The machines between which the file
> transfers occur should be in time sync to avoid any extra copies.
> Improving testing and documentation for modification time checks between
> different object stores to ensure no incorrect skipping of files.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]