[jira] [Commented] (HADOOP-18596) Distcp -update between different cloud stores to use modification time while checking for file skip.

Steve Loughran (Jira) Tue, 17 Jan 2023 06:37:21 -0800


    [ 
https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17677789#comment-17677789
 ]


Steve Loughran commented on HADOOP-18596:
-----------------------------------------

note we already state in the docs that modtime i used. it's just the author of 
that paragraph (me) was misinformed.

expecting clocks to be in sync is unrealistic. at least the modtime is 
configured to to be UTC everywhere, so there's no time zone conversion 
-provided this requirement is met everywhere. I am confident the big cloud 
vendors get it right (our tests would probably have caught this by now), but 
private minio deployments may be misconfigured with both NTP and tz.


Mehakmeet's proposal will not cause copies which would not be skipped today to 
be skipped with the patch. what it will do is cause updates where the file 
length is the same to now be copied if source time > dest time. The worst case 
then is "not all updated files are detected".

note that this will also address cross-EZ copies better, because there the hdfs 
cluster will be in 100% sync. same for copies within the same s3/azure/gcs 
store but within the same fs uri or across containers/buckets/accounts.

The way to do this *properly* would be to log the checksum/etag of the sauce 
and update if that is different from the last upload. They'll be no need to 
check the destination at all, assuming the workflow is nothing but a chain of 
distcp++ jobs. Something like that would be a complete rewrite and I have no 
enthusiasm for that. FWIW I have played with using spark for a distcp successor 
 
https://github.com/hortonworks-spark/cloud-integration/blob/master/spark-cloud-integration/src/main/scala/com/cloudera/spark/cloud/applications/CloudCp.scala
If I was going to replace distcp I'd do it that way
* modern execution env for dynamically passing work around
* rate throttling across job (allocate capacity to each worker process, share 
that across all active threads)
* good view of progress
* could provide an API to take any RDD as a source of the list of files to 
upload
* IOStats can be collected and marshalled back from workers to driver
* generate avro summary of the update which can then be converted into human 
reports.

I'm not going to go there. One challenge is actually recovering from failure of 
the job as a complete restart would copy up all files for which the summary 
.avro file hasn't yet been generated. you'd actually want to commit the summary 
of each task attempt *in task commit* so that a new job would be able to pick 
it up and continue. mapreduce AM restart does this automatically, but not spark.

Then there's all the ideas from Apache Gobblin.

A distcp successor would be a massive undertaking and doesn't need to be in the 
hadoop modules. What Mehakmeet proposes is possible, doesn't add any risk of 
reduced copy (only increased copies) and fairly easy to test.

> Distcp -update between different cloud stores to use modification time while 
> checking for file skip.
> ----------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-18596
>                 URL: https://issues.apache.org/jira/browse/HADOOP-18596
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: tools/distcp
>            Reporter: Mehakmeet Singh
>            Assignee: Mehakmeet Singh
>            Priority: Major
>
> Distcp -update currently relies on File size, block size, and Checksum 
> comparisons to figure out which files should be skipped or copied. 
> Since different cloud stores have different checksum algorithms we should 
> check for modification time as well to the checks.
> This would ensure that while performing -update if the files are perceived to 
> be out of sync we should copy them. The machines between which the file 
> transfers occur should be in time sync to avoid any extra copies.
> Improving testing and documentation for modification time checks between 
> different object stores to ensure no incorrect skipping of files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HADOOP-18596) Distcp -update between different cloud stores to use modification time while checking for file skip.

Reply via email to