[ 
https://issues.apache.org/jira/browse/HIVE-3477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13458017#comment-13458017
 ] 

Ashutosh Chauhan commented on HIVE-3477:
----------------------------------------

In general I think we should start adopting {{OutputCommitter}} functionality 
provided by MR framework instead of having our own {{JobClose}} operator. 
Whatever logic we want to run at the end of the query (which is done via 
JobClose currently) can be run in {{commitJob}} of last MR job for the query. 
This insulates us from having to keep a live hive client around for long 
running queries. Secondly, we are replicating functionality which is already 
provided by underlying framework. Possibly, this may also help us in avoiding 
bugs like this one.
This definitely is not in scope of this bug, but wanted to bring this up since 
while fixing this issue you might have more insights into why or why not this 
may be a good idea.
                
> Duplicate data possible with speculative execution for dynamic partitions
> -------------------------------------------------------------------------
>
>                 Key: HIVE-3477
>                 URL: https://issues.apache.org/jira/browse/HIVE-3477
>             Project: Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Namit Jain
>         Attachments: hive.3477.1.patch
>
>
> Consider a query like:
> insert overwrite T partition (ds)
> select * from
> (mapreduce-subq1
>   union all
> mapreduce-subq2)x;
> Once, mapreduce-subq1 and mapreduce-subq2 are done, the task for the union
> is invoked. At the end of the union task, jobClose is invoked.
> Note that there are 2 tablescan operators. The tree is something like:
> TABLESCAN1  --
>               \
>                UNION -- SELECT -- FILESINK
>               /
> TABLESCAN2  --
> In the current setup, jobClose will be invoked twice for FileSink.
> In case of speculative execution, it is possible that data is still is
> being written to tmp Dir. after jobClose is finished once. 
> The correct fix would be to make sure that jobClose is only invoked once.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to