[jira] [Resolved] (HADOOP-13912) S3a Multipart Committer (avoid rename)

Steve Loughran (JIRA) Wed, 11 Jan 2017 06:13:24 -0800

     [ 
https://issues.apache.org/jira/browse/HADOOP-13912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Steve Loughran resolved HADOOP-13912.
-------------------------------------
    Resolution: Duplicate

closing as duplicate of HADOOP-1786; adding subjiras there

> S3a Multipart Committer (avoid rename)
> --------------------------------------
>
>                 Key: HADOOP-13912
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13912
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: fs/s3
>            Reporter: Thomas Demoor
>            Assignee: Thomas Demoor
>
> Object stores do not have an efficient rename operation, which is used by the 
> Hadoop FileOutputCommitter to atomically promote the "winning" attempt out of 
> the multiple (speculative) attempts to the final path. These slow job commits 
> are one of the main friction points when using object stores in Hadoop.There 
> have been quite some attempts at resolving this: HADOOP-9565, Apache Spark 
> DirectOutputCommitters, ... but they have proven not to be robust in face of 
> adversity (network partitions, ...).
> The current ticket proposes to do the atomic commit by using the S3 Multipart 
> API, which allows multiple concurrent uploads on the same objectname, each in 
> its own "temporary space, identified by the UploadId which is returned as a 
> response to InitiateMultipartUpload. Every attempt writes directly to the 
> final {{outputPath}}. Data is uploaded using Put Part and as a response an 
> ETag for the part is returned and stored. The CompleteMultipartUpload is 
> postponed. Instead, we persist the UploadId (using a _temporary subdir or 
> elsewhere) and the ETags. When a certain "job" wins 
> {{CompleteMultipartUpload}} is called for each of its files using the proper 
> list of Part ETags. 
> Completing a MultipartUpload is a metadata only operation (internally in S3) 
> and is thus orders of magnitude faster than the rename-based approach which 
> moves all the data. 
> Required work: 
> * Expose the multipart initiate and complete calls in S3AOutputStream to 
> S3AFilesystem 
> * Use these multipart calls in a custom committer as described above. I 
> propose to build on the S3ACommitter [[email protected]] is doing for 
> HADOOP-13786



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Resolved] (HADOOP-13912) S3a Multipart Committer (avoid rename)

Reply via email to