[ https://issues.apache.org/jira/browse/HADOOP-13912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Steve Loughran resolved HADOOP-13912. ------------------------------------- Resolution: Duplicate closing as duplicate of HADOOP-1786; adding subjiras there > S3a Multipart Committer (avoid rename) > -------------------------------------- > > Key: HADOOP-13912 > URL: https://issues.apache.org/jira/browse/HADOOP-13912 > Project: Hadoop Common > Issue Type: New Feature > Components: fs/s3 > Reporter: Thomas Demoor > Assignee: Thomas Demoor > > Object stores do not have an efficient rename operation, which is used by the > Hadoop FileOutputCommitter to atomically promote the "winning" attempt out of > the multiple (speculative) attempts to the final path. These slow job commits > are one of the main friction points when using object stores in Hadoop.There > have been quite some attempts at resolving this: HADOOP-9565, Apache Spark > DirectOutputCommitters, ... but they have proven not to be robust in face of > adversity (network partitions, ...). > The current ticket proposes to do the atomic commit by using the S3 Multipart > API, which allows multiple concurrent uploads on the same objectname, each in > its own "temporary space, identified by the UploadId which is returned as a > response to InitiateMultipartUpload. Every attempt writes directly to the > final {{outputPath}}. Data is uploaded using Put Part and as a response an > ETag for the part is returned and stored. The CompleteMultipartUpload is > postponed. Instead, we persist the UploadId (using a _temporary subdir or > elsewhere) and the ETags. When a certain "job" wins > {{CompleteMultipartUpload}} is called for each of its files using the proper > list of Part ETags. > Completing a MultipartUpload is a metadata only operation (internally in S3) > and is thus orders of magnitude faster than the rename-based approach which > moves all the data. > Required work: > * Expose the multipart initiate and complete calls in S3AOutputStream to > S3AFilesystem > * Use these multipart calls in a custom committer as described above. I > propose to build on the S3ACommitter [~ste...@apache.org] is doing for > HADOOP-13786 -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-dev-h...@hadoop.apache.org