[jira] [Commented] (HIVE-16295) Add support for using Hadoop's S3A OutputCommitter

Sahil Takiar (JIRA) Wed, 25 Apr 2018 09:56:35 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-16295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16452628#comment-16452628
 ]


Sahil Takiar commented on HIVE-16295:
-------------------------------------

[[email protected]] yes, I've been planning to hook up into the committer 
factory at some point, just haven't gotten around to it yet. Yes, I've been 
using the _SUCCESS file, its been very useful.

One question I did have about the committer-factories. Is there a reason 
{{PathOutputCommitterFactory}} doesn't provide a way to construct a 
{{PathOutputCommitter}} using a {{JobContext}} rather than a 
{{TaskAttemptContext}}? Right now, in HiveServer2 (aka Hive's App Master), I 
create the {{PathOutputCommitter}} using a {{TaskAttemptContext}}, which feels 
a bit odd since its not really a task, right? Plus, the Javadocs for 
{{PathOutputCommitter}} say that subclasses should provide a public constructor 
with the following signature {{#(Path outputPath, JobContext context)}}.

Another question, does the {{DirectoryOutputCommitter}} work with Spark SQL or 
just Spark? I'm hitting some issues with Dynamic Partitioning queries in Hive 
(see below) and am curious how Spark SQL handles this.

--

Quick update on my progress. Bad news is that getting things to work for 
dynamic partitioning is going to be much more effort than I thought, so I'm 
thinking of doing it in a separate JIRA. There are the number of issues but the 
biggest one is described below:

* Dynamic partitioning 
(https://cwiki.apache.org/confluence/display/Hive/DynamicPartitions) allows a 
query to dynamically create partitions at runtime; which means that the final 
output directories for a MR / Spark job aren't known until after the tasks have 
run
* Hive's commit logic requires collecting and registering all modified 
partitions that have been created (or updated) during the lifetime of a query
** For non-DP queries, this is simple, because at compile time you know exactly 
which partitions you are modifying
** For DP queries, you have no way of knowing which partitions have been 
created / updated until after the MR / Spark job has completed
*** So in order to find all updated partitions, Hive gets does a recursive 
listing on the tmp directory that stores all the intermediate data for a MR / 
Spark job
*** Once it finds all the updated partitions, it explicitly registers them in 
HMS, collects some stats, emits lineage info, etc.
* The issue is that with the S3A Output Committers there is no longer a tmp 
path to recursively iterate through, this means Hive has no way of knowing what 
new partitions have been created (or which partitions have been overwritten)
** Iterating through the final output directory (e.g. the table dir) doesn't 
work because you can't differentiate between updated partitions and non-updated 
ones

The best approach I can think of is for each Hive task to create a new manifest 
file tracking all the partitions it has written. We do similar things for stats 
collection + Hive-ACID.

> Add support for using Hadoop's S3A OutputCommitter
> --------------------------------------------------
>
>                 Key: HIVE-16295
>                 URL: https://issues.apache.org/jira/browse/HIVE-16295
>             Project: Hive
>          Issue Type: Sub-task
>            Reporter: Sahil Takiar
>            Assignee: Sahil Takiar
>            Priority: Major
>         Attachments: HIVE-16295.1.WIP.patch, HIVE-16295.2.WIP.patch
>
>
> Hive doesn't have integration with Hadoop's {{OutputCommitter}}, it uses a 
> {{NullOutputCommitter}} and uses its own commit logic spread across 
> {{FileSinkOperator}}, {{MoveTask}}, and {{Hive}}.
> The Hadoop community is building an {{OutputCommitter}} that integrates with 
> S3Guard and does a safe, coordinate commit of data on S3 inside individual 
> tasks (HADOOP-13786). If Hive can integrate with this new {{OutputCommitter}} 
> there would be a lot of benefits to Hive-on-S3:
> * Data is only written once; directly committing data at a task level means 
> no renames are necessary
> * The commit is done safely, in a coordinated manner; duplicate tasks (from 
> task retries or speculative execution) should not step on each other



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVE-16295) Add support for using Hadoop's S3A OutputCommitter

Reply via email to