On 21 Feb 2017, at 14:15, Steve Loughran <ste...@hortonworks.com<mailto:ste...@hortonworks.com>> wrote:
What your patch has made me realise is that I could also do a delayed-commit copy by reading in a file, doing a multipart put to its final destination, and again, postponing the final commit. this is something which tasks could do in their commit rather than a normal COPY+DELETE rename, passing the final pending commit information to the job committer. This'd make the rename() slower as it will read and write the data again, rather than the 6-10 MB/s of in-S3 copies, but as these happen in-task-commit, rather than in-job-commit, they slow down the overall job less. That could be used for the absolute path commit phase. though as you can do specify a copy-range in a multipart put, you could do a parallelized copies of parts of a file in the s3 filestore itself and leave the result pending, reducing copy time in seconds to ~ filesize / (parts * 6e6), the same as you get from a parallel copy in s3 today. That is: same time as a rename, merely not visible until the final job chooses to materialize the object