[ 
https://issues.apache.org/jira/browse/TEZ-3336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15378093#comment-15378093
 ] 

Siddharth Seth commented on TEZ-3336:
-------------------------------------

These events are used for dynamic partition pruning.

As a simple example. Join on a date_ref between a master table and date_dim - 
with the query specifying a date range, and the master table partitioned by 
day/month.

Hive will generate it's regular plan for a MapJoin with date_dim. Splits for 
the master table are delayed till date_dim is done. Along with sending a map to 
the next stage, Hive sends these InputInitializerEvents which contain the 
date_ref values which satisfy the filter. HiveSplitGenerator then uses these to 
cut down the number of splits based on the partitioning scheme.
I believe this optimization resulted in good performance gains, and removed the 
requirement to re-write some queries to run them faster.

Hive should either use HiveSplitGenerator even with CombineHiveInputFormat, or 
disable this optimization in that case. (HiveSplitGenerator has it's own logic 
for schema evolution etc - similar to CombineHiveInputFormat)

> Hive map-side join job sometimes fails with ROOT_INPUT_INIT_FAILURE
> -------------------------------------------------------------------
>
>                 Key: TEZ-3336
>                 URL: https://issues.apache.org/jira/browse/TEZ-3336
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.7.1
>            Reporter: Jason Lowe
>
> When Hive does a map-side join it can generate a DAG where a vertex has two 
> inputs, one from an upstream task and another using MRInputAMSplitGenerator.  
> If it takes a while for MRInputAMSplitGenerator to compute the splits and one 
> of the tasks for the other upstream vertex completes then the job can fail 
> with an error since MRInputAMSplitGenerator does not expect to receive any 
> events.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to