[
https://issues.apache.org/jira/browse/HADOOP-18596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17677789#comment-17677789
]
Steve Loughran commented on HADOOP-18596:
-----------------------------------------
note we already state in the docs that modtime i used. it's just the author of
that paragraph (me) was misinformed.
expecting clocks to be in sync is unrealistic. at least the modtime is
configured to to be UTC everywhere, so there's no time zone conversion
-provided this requirement is met everywhere. I am confident the big cloud
vendors get it right (our tests would probably have caught this by now), but
private minio deployments may be misconfigured with both NTP and tz.
Mehakmeet's proposal will not cause copies which would not be skipped today to
be skipped with the patch. what it will do is cause updates where the file
length is the same to now be copied if source time > dest time. The worst case
then is "not all updated files are detected".
note that this will also address cross-EZ copies better, because there the hdfs
cluster will be in 100% sync. same for copies within the same s3/azure/gcs
store but within the same fs uri or across containers/buckets/accounts.
The way to do this *properly* would be to log the checksum/etag of the sauce
and update if that is different from the last upload. They'll be no need to
check the destination at all, assuming the workflow is nothing but a chain of
distcp++ jobs. Something like that would be a complete rewrite and I have no
enthusiasm for that. FWIW I have played with using spark for a distcp successor
https://github.com/hortonworks-spark/cloud-integration/blob/master/spark-cloud-integration/src/main/scala/com/cloudera/spark/cloud/applications/CloudCp.scala
If I was going to replace distcp I'd do it that way
* modern execution env for dynamically passing work around
* rate throttling across job (allocate capacity to each worker process, share
that across all active threads)
* good view of progress
* could provide an API to take any RDD as a source of the list of files to
upload
* IOStats can be collected and marshalled back from workers to driver
* generate avro summary of the update which can then be converted into human
reports.
I'm not going to go there. One challenge is actually recovering from failure of
the job as a complete restart would copy up all files for which the summary
.avro file hasn't yet been generated. you'd actually want to commit the summary
of each task attempt *in task commit* so that a new job would be able to pick
it up and continue. mapreduce AM restart does this automatically, but not spark.
Then there's all the ideas from Apache Gobblin.
A distcp successor would be a massive undertaking and doesn't need to be in the
hadoop modules. What Mehakmeet proposes is possible, doesn't add any risk of
reduced copy (only increased copies) and fairly easy to test.
> Distcp -update between different cloud stores to use modification time while
> checking for file skip.
> ----------------------------------------------------------------------------------------------------
>
> Key: HADOOP-18596
> URL: https://issues.apache.org/jira/browse/HADOOP-18596
> Project: Hadoop Common
> Issue Type: Improvement
> Components: tools/distcp
> Reporter: Mehakmeet Singh
> Assignee: Mehakmeet Singh
> Priority: Major
>
> Distcp -update currently relies on File size, block size, and Checksum
> comparisons to figure out which files should be skipped or copied.
> Since different cloud stores have different checksum algorithms we should
> check for modification time as well to the checks.
> This would ensure that while performing -update if the files are perceived to
> be out of sync we should copy them. The machines between which the file
> transfers occur should be in time sync to avoid any extra copies.
> Improving testing and documentation for modification time checks between
> different object stores to ensure no incorrect skipping of files.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]