[jira] [Commented] (HIVE-16295) Add support for using Hadoop's S3A OutputCommitter

Sahil Takiar (JIRA) Tue, 01 May 2018 14:34:38 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-16295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16460183#comment-16460183
 ]


Sahil Takiar commented on HIVE-16295:
-------------------------------------

Attached updated patch.

For the _SUCCESS file, is it something that is common to all 
{{PathOutputCommitter}} implementations? I would prefer a solution that works 
for all committers rather than just for the S3A one. It might just be best for 
Hive to create its own manifest file, I don't think that should be too 
difficult to add in. However, I've reduced the scope of this patch and am just 
going to focus on non-dynamic partitioning queries. So we can continue the 
discussion over the _SUCCESS file in HIVE-19321.

[[email protected]] does {{PathOutputCommitterFactory}} need to be a private 
API (its marked as {{InterfaceAudience.Private}})? In my updated patch, I'm 
using it to create the {{PathOutputCommitter}} specified by 
{{mapreduce.outputcommitter.factory.scheme.[uri-scheme]}}

Updates:
* Lots of code cleanup and re-factoring
* Using the {{PathOutputCommitterFactory}} instead of using reflection
* Some bug fixes around naming of files
* Found a proper way to handle merge-file jobs

What's Next:
* Cleanup of logic in {{MoveTask}}
* Explicit qtests

> Add support for using Hadoop's S3A OutputCommitter
> --------------------------------------------------
>
>                 Key: HIVE-16295
>                 URL: https://issues.apache.org/jira/browse/HIVE-16295
>             Project: Hive
>          Issue Type: Sub-task
>            Reporter: Sahil Takiar
>            Assignee: Sahil Takiar
>            Priority: Major
>         Attachments: HIVE-16295.1.WIP.patch, HIVE-16295.2.WIP.patch, 
> HIVE-16295.3.WIP.patch
>
>
> Hive doesn't have integration with Hadoop's {{OutputCommitter}}, it uses a 
> {{NullOutputCommitter}} and uses its own commit logic spread across 
> {{FileSinkOperator}}, {{MoveTask}}, and {{Hive}}.
> The Hadoop community is building an {{OutputCommitter}} that integrates with 
> S3Guard and does a safe, coordinate commit of data on S3 inside individual 
> tasks (HADOOP-13786). If Hive can integrate with this new {{OutputCommitter}} 
> there would be a lot of benefits to Hive-on-S3:
> * Data is only written once; directly committing data at a task level means 
> no renames are necessary
> * The commit is done safely, in a coordinated manner; duplicate tasks (from 
> task retries or speculative execution) should not step on each other



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVE-16295) Add support for using Hadoop's S3A OutputCommitter

Reply via email to