[ 
https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17677654#comment-17677654
 ] 

Mehakmeet Singh commented on HADOOP-18596:
------------------------------------------

{quote}How would you ensure that they are in sync, two clocks *perfectly* in 
sync is kind of looks tough.
{quote}
Good question. Although I am not sure if there's a way to have perfect time 
sync, I think that we can use NTP(it is already used widely) to minimize any 
time sync latency. Cloud service like AWS already has their internal time 
synchronization service and if we have NTP configured in the source machine, 
this should ensure that time is in sync between the two machines. Although not 
perfect, it should be enough for distcp -update to not skip the files 
incorrectly due to that.

We should look out that there shouldn't be a massive difference between the 
clocks so that the updation of the source files from one version to another 
should be more recent than the previous version being synced to cloud storage 
for example.
{quote}Do you plan to introduce an additional option for this or make it a 
default
{quote}
We are planning to have this by default since this adds more resilience to 
cases where checksum algorithms cannot be compared between different object 
stores.

 

CC: [[email protected]] [~mthakur] 

> Distcp -update between different cloud stores to use modification time while 
> checking for file skip.
> ----------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-18596
>                 URL: https://issues.apache.org/jira/browse/HADOOP-18596
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: tools/distcp
>            Reporter: Mehakmeet Singh
>            Assignee: Mehakmeet Singh
>            Priority: Major
>
> Distcp -update currently relies on File size, block size, and Checksum 
> comparisons to figure out which files should be skipped or copied. 
> Since different cloud stores have different checksum algorithms we should 
> check for modification time as well to the checks.
> This would ensure that while performing -update if the files are perceived to 
> be out of sync we should copy them. The machines between which the file 
> transfers occur should be in time sync to avoid any extra copies.
> Improving testing and documentation for modification time checks between 
> different object stores to ensure no incorrect skipping of files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to