[
https://issues.apache.org/jira/browse/HIVE-17082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Gopal V resolved HIVE-17082.
----------------------------
Resolution: Not A Problem
[~rajesh.balamohan]: the semi-join can be removed in this case, because there
is no shuffle between the map-join and the semi-join operators.
> Dynamic semi join gets turned off at compile time
> -------------------------------------------------
>
> Key: HIVE-17082
> URL: https://issues.apache.org/jira/browse/HIVE-17082
> Project: Hive
> Issue Type: Bug
> Reporter: Rajesh Balamohan
>
> With Hive-master:
> =================
> {noformat}
> 2017-07-13T08:35:55,042 DEBUG [056200f2-a53f-4f38-a9e7-8bb411c73349 main]
> optimizer.DynamicPartitionPruningOptimization: Initiate semijoin reduction
> for sr_ticket_number ((sr_ticket_number is not null and (sr_ticket_number) IN
> (RS[6]))
> 2017-07-13T08:35:55,043 DEBUG [056200f2-a53f-4f38-a9e7-8bb411c73349 main]
> optimizer.DynamicPartitionPruningOptimization: DynamicSemiJoinPushdown:
> Saving RS to TS mapping: RS[28]: TS[3]
> 2017-07-13T08:35:55,398 DEBUG [056200f2-a53f-4f38-a9e7-8bb411c73349 main]
> optimizer.ConvertJoinMapJoin: Found semijoin optimization from the big table
> side of a map join, which will cause a task cycle. Removing semijoin RS[28] -
> TS[3] (store_returns)
> 2017-07-13T08:35:55,400 DEBUG [056200f2-a53f-4f38-a9e7-8bb411c73349 main]
> parse.TezCompiler: Computing key domain cardinality,
> keyDomainCardinality=95121413, semiJoinKeyIsPK=false, selColStat= colName:
> _col0 colType: bigint countDistincts: 8362530 numNulls: 0 avgColLen: 8.0
> numTrues: 0 numFalses: 0 Range: [ min: 1 max: 240000000 ] isPrimaryKey:
> false, selColSourceStat= colName: sr_ticket_number colType: bigint
> countDistincts: 8362530 numNulls: 0 avgColLen: 8.0 numTrues: 0 numFalses: 0
> Range: [ min: 1 max: 240000000 ] isPrimaryKey: false, tsColStat= colName:
> ss_ticket_number colType: bigint countDistincts: 86758883 numNulls: 0
> avgColLen: 8.0 numTrues: 0 numFalses: 0 Range: [ min: 1 max: 240000000 ]
> isPrimaryKey: false
> 2017-07-13T08:35:55,400 DEBUG [056200f2-a53f-4f38-a9e7-8bb411c73349 main]
> parse.TezCompiler: SemiJoin key selectivity=0.08791427436007496,
> benefit=2.6267959439021907E9
> 2017-07-13T08:35:55,400 DEBUG [056200f2-a53f-4f38-a9e7-8bb411c73349 main]
> parse.TezCompiler: BloomFilter benefit=2.6267959439021907E9,
> cost=2.87999764E8, tsDataSize=2879987999, netBenefit=2.3387961799021907E9
> 2017-07-13T08:35:55,400 DEBUG [056200f2-a53f-4f38-a9e7-8bb411c73349 main]
> parse.TezCompiler: netBenefit=0.8120853908815856
> 2017-07-13T08:35:55,400 DEBUG [056200f2-a53f-4f38-a9e7-8bb411c73349 main]
> parse.TezCompiler: Semijoin optimization with parallel edge to map join.
> Removing semijoin RS[23] - TS[0] (store_sales)
> > explain select count(1) from store_sales, store_returns where
> > sr_ticket_number = ss_ticket_number;
> OK
> STAGE DEPENDENCIES:
> Stage-1 is a root stage
> Stage-0 depends on stages: Stage-1
> STAGE PLANS:
> Stage: Stage-1
> Tez
> DagId: rbalamohan_20170713083602_0ed509c0-0311-480e-a01c-bafcb259a5fe:3
> Edges:
> Map 1 <- Map 3 (BROADCAST_EDGE)
> Reducer 2 <- Map 1 (CUSTOM_SIMPLE_EDGE)
> DagName:
> Vertices:
> Map 1
> Map Operator Tree:
> TableScan
> alias: store_sales
> filterExpr: ss_ticket_number is not null (type: boolean)
> Statistics: Num rows: 2879987999 Data size: 23039903992
> Basic stats: COMPLETE Column stats: COMPLETE
> Filter Operator
> predicate: ss_ticket_number is not null (type: boolean)
> Statistics: Num rows: 2879987999 Data size: 23039903992
> Basic stats: COMPLETE Column stats: COMPLETE
> Select Operator
> expressions: ss_ticket_number (type: bigint)
> outputColumnNames: _col0
> Statistics: Num rows: 2879987999 Data size: 23039903992
> Basic stats: COMPLETE Column stats: COMPLETE
> Map Join Operator
> condition map:
> Inner Join 0 to 1
> keys:
> 0 _col0 (type: bigint)
> 1 _col0 (type: bigint)
> input vertices:
> 1 Map 3
> Statistics: Num rows: 9560241388 Data size:
> 76481931104 Basic stats: COMPLETE Column stats: COMPLETE
> Group By Operator
> aggregations: count()
> mode: hash
> outputColumnNames: _col0
> Statistics: Num rows: 1 Data size: 8 Basic stats:
> COMPLETE Column stats: COMPLETE
> Reduce Output Operator
> sort order:
> Statistics: Num rows: 1 Data size: 8 Basic stats:
> COMPLETE Column stats: COMPLETE
> value expressions: _col0 (type: bigint)
> Execution mode: vectorized, llap
> Map 3
> Map Operator Tree:
> TableScan
> alias: store_returns
> filterExpr: sr_ticket_number is not null (type: boolean)
> Statistics: Num rows: 287999764 Data size: 2303998112 Basic
> stats: COMPLETE Column stats: COMPLETE
> Filter Operator
> predicate: sr_ticket_number is not null (type: boolean)
> Statistics: Num rows: 287999764 Data size: 2303998112
> Basic stats: COMPLETE Column stats: COMPLETE
> Select Operator
> expressions: sr_ticket_number (type: bigint)
> outputColumnNames: _col0
> Statistics: Num rows: 287999764 Data size: 2303998112
> Basic stats: COMPLETE Column stats: COMPLETE
> Reduce Output Operator
> key expressions: _col0 (type: bigint)
> sort order: +
> Map-reduce partition columns: _col0 (type: bigint)
> Statistics: Num rows: 287999764 Data size: 2303998112
> Basic stats: COMPLETE Column stats: COMPLETE
> Execution mode: vectorized, llap
> Reducer 2
> Execution mode: vectorized, llap
> Reduce Operator Tree:
> Group By Operator
> aggregations: count(VALUE._col0)
> mode: mergepartial
> outputColumnNames: _col0
> Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE
> Column stats: COMPLETE
> File Output Operator
> compressed: false
> Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE
> Column stats: COMPLETE
> table:
> input format:
> org.apache.hadoop.mapred.SequenceFileInputFormat
> output format:
> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
> serde:
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> Stage: Stage-0
> Fetch Operator
> limit: -1
> Processor Tree:
> ListSink
> {noformat}
> Without TezCompiler::removeSemijoinsParallelToMapJoin:
> ======================================================
> Semi join gets invoked
> {noformat}
> > explain select count(1) from store_sales, store_returns where
> sr_ticket_number = ss_ticket_number;
> OK
> STAGE DEPENDENCIES:
> Stage-1 is a root stage
> Stage-0 depends on stages: Stage-1
> STAGE PLANS:
> Stage: Stage-1
> Tez
> DagId: rbalamohan_20170713082329_4c868b9a-6113-4da8-8c9a-66d9018e45c0:6
> Edges:
> Map 1 <- Map 3 (BROADCAST_EDGE), Reducer 4 (BROADCAST_EDGE)
> Reducer 2 <- Map 1 (CUSTOM_SIMPLE_EDGE)
> Reducer 4 <- Map 3 (CUSTOM_SIMPLE_EDGE)
> DagName:
> Vertices:
> Map 1
> Map Operator Tree:
> TableScan
> alias: store_sales
> filterExpr: (ss_ticket_number is not null and
> (ss_ticket_number BETWEEN
> DynamicValue(RS_7_store_returns_sr_ticket_number_min) AND
> DynamicValue(RS_7_store_returns_sr_ticket_number_max) and
> in_bloom_filter(ss_ticket_number,
> DynamicValue(RS_7_store_returns_sr_ticket_number_bloom_filter)))) (type:
> boolean)
> Statistics: Num rows: 2879987999 Data size: 23039903992
> Basic stats: COMPLETE Column stats: COMPLETE
> Filter Operator
> predicate: (ss_ticket_number is not null and
> (ss_ticket_number BETWEEN
> DynamicValue(RS_7_store_returns_sr_ticket_number_min) AND
> DynamicValue(RS_7_store_returns_sr_ticket_number_max) and
> in_bloom_filter(ss_ticket_number,
> DynamicValue(RS_7_store_returns_sr_ticket_number_bloom_filter)))) (type:
> boolean)
> Statistics: Num rows: 2879987999 Data size: 23039903992
> Basic stats: COMPLETE Column stats: COMPLETE
> Select Operator
> expressions: ss_ticket_number (type: bigint)
> outputColumnNames: _col0
> Statistics: Num rows: 2879987999 Data size: 23039903992
> Basic stats: COMPLETE Column stats: COMPLETE
> Map Join Operator
> condition map:
> Inner Join 0 to 1
> keys:
> 0 _col0 (type: bigint)
> 1 _col0 (type: bigint)
> input vertices:
> 1 Map 3
> Statistics: Num rows: 9560241388 Data size:
> 76481931104 Basic stats: COMPLETE Column stats: COMPLETE
> Group By Operator
> aggregations: count()
> mode: hash
> outputColumnNames: _col0
> Statistics: Num rows: 1 Data size: 8 Basic stats:
> COMPLETE Column stats: COMPLETE
> Reduce Output Operator
> sort order:
> Statistics: Num rows: 1 Data size: 8 Basic stats:
> COMPLETE Column stats: COMPLETE
> value expressions: _col0 (type: bigint)
> Execution mode: vectorized, llap
> Map 3
> Map Operator Tree:
> TableScan
> alias: store_returns
> filterExpr: sr_ticket_number is not null (type: boolean)
> Statistics: Num rows: 287999764 Data size: 2303998112 Basic
> stats: COMPLETE Column stats: COMPLETE
> Filter Operator
> predicate: sr_ticket_number is not null (type: boolean)
> Statistics: Num rows: 287999764 Data size: 2303998112
> Basic stats: COMPLETE Column stats: COMPLETE
> Select Operator
> expressions: sr_ticket_number (type: bigint)
> outputColumnNames: _col0
> Statistics: Num rows: 287999764 Data size: 2303998112
> Basic stats: COMPLETE Column stats: COMPLETE
> Reduce Output Operator
> key expressions: _col0 (type: bigint)
> sort order: +
> Map-reduce partition columns: _col0 (type: bigint)
> Statistics: Num rows: 287999764 Data size: 2303998112
> Basic stats: COMPLETE Column stats: COMPLETE
> Select Operator
> expressions: _col0 (type: bigint)
> outputColumnNames: _col0
> Statistics: Num rows: 287999764 Data size: 2303998112
> Basic stats: COMPLETE Column stats: COMPLETE
> Group By Operator
> aggregations: min(_col0), max(_col0),
> bloom_filter(_col0, expectedEntries=16725060)
> mode: hash
> outputColumnNames: _col0, _col1, _col2
> Statistics: Num rows: 1 Data size: 24 Basic stats:
> COMPLETE Column stats: COMPLETE
> Reduce Output Operator
> sort order:
> Statistics: Num rows: 1 Data size: 24 Basic
> stats: COMPLETE Column stats: COMPLETE
> value expressions: _col0 (type: bigint), _col1
> (type: bigint), _col2 (type: binary)
> Execution mode: vectorized, llap
> Reducer 2
> Execution mode: vectorized, llap
> Reduce Operator Tree:
> Group By Operator
> aggregations: count(VALUE._col0)
> mode: mergepartial
> outputColumnNames: _col0
> Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE
> Column stats: COMPLETE
> File Output Operator
> compressed: false
> Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE
> Column stats: COMPLETE
> table:
> input format:
> org.apache.hadoop.mapred.SequenceFileInputFormat
> output format:
> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
> serde:
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> Reducer 4
> Execution mode: vectorized, llap
> Reduce Operator Tree:
> Group By Operator
> aggregations: min(VALUE._col0), max(VALUE._col1),
> bloom_filter(VALUE._col2, expectedEntries=16725060)
> mode: final
> outputColumnNames: _col0, _col1, _col2
> Statistics: Num rows: 1 Data size: 24 Basic stats: COMPLETE
> Column stats: COMPLETE
> Reduce Output Operator
> sort order:
> Statistics: Num rows: 1 Data size: 24 Basic stats: COMPLETE
> Column stats: COMPLETE
> value expressions: _col0 (type: bigint), _col1 (type:
> bigint), _col2 (type: binary)
> Stage: Stage-0
> Fetch Operator
> limit: -1
> Processor Tree:
> ListSink
> {noformat}
> Related ticket: HIVE-16260
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)