[
https://issues.apache.org/jira/browse/TEZ-3336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15378093#comment-15378093
]
Siddharth Seth commented on TEZ-3336:
-------------------------------------
These events are used for dynamic partition pruning.
As a simple example. Join on a date_ref between a master table and date_dim -
with the query specifying a date range, and the master table partitioned by
day/month.
Hive will generate it's regular plan for a MapJoin with date_dim. Splits for
the master table are delayed till date_dim is done. Along with sending a map to
the next stage, Hive sends these InputInitializerEvents which contain the
date_ref values which satisfy the filter. HiveSplitGenerator then uses these to
cut down the number of splits based on the partitioning scheme.
I believe this optimization resulted in good performance gains, and removed the
requirement to re-write some queries to run them faster.
Hive should either use HiveSplitGenerator even with CombineHiveInputFormat, or
disable this optimization in that case. (HiveSplitGenerator has it's own logic
for schema evolution etc - similar to CombineHiveInputFormat)
> Hive map-side join job sometimes fails with ROOT_INPUT_INIT_FAILURE
> -------------------------------------------------------------------
>
> Key: TEZ-3336
> URL: https://issues.apache.org/jira/browse/TEZ-3336
> Project: Apache Tez
> Issue Type: Bug
> Affects Versions: 0.7.1
> Reporter: Jason Lowe
>
> When Hive does a map-side join it can generate a DAG where a vertex has two
> inputs, one from an upstream task and another using MRInputAMSplitGenerator.
> If it takes a while for MRInputAMSplitGenerator to compute the splits and one
> of the tasks for the other upstream vertex completes then the job can fail
> with an error since MRInputAMSplitGenerator does not expect to receive any
> events.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)