[
https://issues.apache.org/jira/browse/HIVE-16295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16503216#comment-16503216
]
Steve Loughran commented on HIVE-16295:
---------------------------------------
* PathOutputCommitterFactory; you can ask for that to become limited private +
unstable and add Hive into the mix, add a MAPREDUCE patch
* for the other, again, a limited private + unstable for the internal commit
constant, so we know to leave it alone , under HADOOP
bq. For the _SUCCESS file, is it something that is common to all
PathOutputCommitter implementations
It's done in the S3A one, not done for FileOutputCommitter. The IBM Stocator
committer also does a JSON manifest, just a different one (i.e. I don't know
the details). We explicitly stuck a version marker on the one the S3A committer
currently uses so as to allow for change, that is: the deser code will fail if
that's not there/the wrong version.
FWIW, I do parse the file in my spark tests. Originally I had my own copy &
paste of the file format, now I just import the s3a one.
> Add support for using Hadoop's S3A OutputCommitter
> --------------------------------------------------
>
> Key: HIVE-16295
> URL: https://issues.apache.org/jira/browse/HIVE-16295
> Project: Hive
> Issue Type: Sub-task
> Reporter: Sahil Takiar
> Assignee: Sahil Takiar
> Priority: Major
> Attachments: HIVE-16295.1.WIP.patch, HIVE-16295.2.WIP.patch,
> HIVE-16295.3.WIP.patch, HIVE-16295.4.patch, HIVE-16295.5.patch,
> HIVE-16295.6.patch, HIVE-16295.7.patch
>
>
> Hive doesn't have integration with Hadoop's {{OutputCommitter}}, it uses a
> {{NullOutputCommitter}} and uses its own commit logic spread across
> {{FileSinkOperator}}, {{MoveTask}}, and {{Hive}}.
> The Hadoop community is building an {{OutputCommitter}} that integrates with
> S3Guard and does a safe, coordinate commit of data on S3 inside individual
> tasks (HADOOP-13786). If Hive can integrate with this new {{OutputCommitter}}
> there would be a lot of benefits to Hive-on-S3:
> * Data is only written once; directly committing data at a task level means
> no renames are necessary
> * The commit is done safely, in a coordinated manner; duplicate tasks (from
> task retries or speculative execution) should not step on each other
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)