[
https://issues.apache.org/jira/browse/HADOOP-13912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Steve Loughran resolved HADOOP-13912.
-------------------------------------
Resolution: Duplicate
closing as duplicate of HADOOP-1786; adding subjiras there
> S3a Multipart Committer (avoid rename)
> --------------------------------------
>
> Key: HADOOP-13912
> URL: https://issues.apache.org/jira/browse/HADOOP-13912
> Project: Hadoop Common
> Issue Type: New Feature
> Components: fs/s3
> Reporter: Thomas Demoor
> Assignee: Thomas Demoor
>
> Object stores do not have an efficient rename operation, which is used by the
> Hadoop FileOutputCommitter to atomically promote the "winning" attempt out of
> the multiple (speculative) attempts to the final path. These slow job commits
> are one of the main friction points when using object stores in Hadoop.There
> have been quite some attempts at resolving this: HADOOP-9565, Apache Spark
> DirectOutputCommitters, ... but they have proven not to be robust in face of
> adversity (network partitions, ...).
> The current ticket proposes to do the atomic commit by using the S3 Multipart
> API, which allows multiple concurrent uploads on the same objectname, each in
> its own "temporary space, identified by the UploadId which is returned as a
> response to InitiateMultipartUpload. Every attempt writes directly to the
> final {{outputPath}}. Data is uploaded using Put Part and as a response an
> ETag for the part is returned and stored. The CompleteMultipartUpload is
> postponed. Instead, we persist the UploadId (using a _temporary subdir or
> elsewhere) and the ETags. When a certain "job" wins
> {{CompleteMultipartUpload}} is called for each of its files using the proper
> list of Part ETags.
> Completing a MultipartUpload is a metadata only operation (internally in S3)
> and is thus orders of magnitude faster than the rename-based approach which
> moves all the data.
> Required work:
> * Expose the multipart initiate and complete calls in S3AOutputStream to
> S3AFilesystem
> * Use these multipart calls in a custom committer as described above. I
> propose to build on the S3ACommitter [[email protected]] is doing for
> HADOOP-13786
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]