On 21 Feb 2017, at 14:15, Steve Loughran 
<ste...@hortonworks.com<mailto:ste...@hortonworks.com>> wrote:

What your patch has made me realise is that I could also do a delayed-commit 
copy by reading in a file, doing a multipart put to its final destination, and 
again, postponing the final commit. this is something which tasks could do in 
their commit rather than a normal COPY+DELETE  rename, passing the final 
pending commit information to the job committer. This'd make the rename() 
slower as it will read and write the data again, rather than the 6-10 MB/s of 
in-S3 copies, but as these happen in-task-commit, rather than in-job-commit, 
they slow down the overall job less. That could be used for the absolute path 
commit phase.


though as you can do specify a copy-range in a multipart put, you could do a 
parallelized copies of parts of a file in the s3 filestore itself and leave the 
result pending, reducing copy time in seconds to ~ filesize / (parts * 6e6), 
the same as you get from a parallel copy in s3 today. That is: same time as a 
rename, merely not visible until the final job chooses to materialize the object

Reply via email to