[jira] [Commented] (HIVE-16295) Add support for using Hadoop's S3A OutputCommitter

Steve Loughran (JIRA) Wed, 25 Apr 2018 10:04:36 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-16295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16452645#comment-16452645
 ]


Steve Loughran commented on HIVE-16295:
---------------------------------------

bq. is there a reason PathOutputCommitterFactory doesn't provide a way to 
construct a PathOutputCommitter using a JobContext rather than a 
TaskAttemptContext

I think it's because the only bits in hadoop & spark where committers were 
being constructed with JobContext alone was in the v1 committers, which these 
committers don't (currently) support. It just kept things simpler all round to 
not have to worry about two similar-but-slightly different constructors.

bq. does the DirectoryOutputCommitter work with Spark SQL or just Spark? I'

should work as a drop in replacement for a normal hadoop FileOutputCommitter; 
its not being clever the way the parititioned one is.

regarding dynamic partitioning, the S3A Committers do know which files they've 
created, which is stuff that goes in the manifest. If you load in the _SUCCESS 
File and read that section, you can infer it. If that works then create a 
hadoop JIRA "stabilize _SUCCESS format" and we'll think about what we can say 
"will always be retained". 

Or is this file being created too late in your workflow?

> Add support for using Hadoop's S3A OutputCommitter
> --------------------------------------------------
>
>                 Key: HIVE-16295
>                 URL: https://issues.apache.org/jira/browse/HIVE-16295
>             Project: Hive
>          Issue Type: Sub-task
>            Reporter: Sahil Takiar
>            Assignee: Sahil Takiar
>            Priority: Major
>         Attachments: HIVE-16295.1.WIP.patch, HIVE-16295.2.WIP.patch
>
>
> Hive doesn't have integration with Hadoop's {{OutputCommitter}}, it uses a 
> {{NullOutputCommitter}} and uses its own commit logic spread across 
> {{FileSinkOperator}}, {{MoveTask}}, and {{Hive}}.
> The Hadoop community is building an {{OutputCommitter}} that integrates with 
> S3Guard and does a safe, coordinate commit of data on S3 inside individual 
> tasks (HADOOP-13786). If Hive can integrate with this new {{OutputCommitter}} 
> there would be a lot of benefits to Hive-on-S3:
> * Data is only written once; directly committing data at a task level means 
> no renames are necessary
> * The commit is done safely, in a coordinated manner; duplicate tasks (from 
> task retries or speculative execution) should not step on each other



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVE-16295) Add support for using Hadoop's S3A OutputCommitter

Reply via email to