[
https://issues.apache.org/jira/browse/HIVE-17486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16221989#comment-16221989
]
liyunzhang commented on HIVE-17486:
-----------------------------------
mapjoin.q
{code}
set hive.mapred.mode=nonstrict;
set hive.explain.user=false;
set hive.auto.convert.join=true;
explain
select src1.key, src1.cnt1, src2.cnt1 from
(
select key, count(*) as cnt1 from
(
select a.key as key, a.value as val1, b.value as val2 from tbl1 a join tbl2
b on a.key = b.key
) subq1 group by key
) src1
join
(
select key, count(*) as cnt1 from
(
select a.key as key, a.value as val1, b.value as val2 from tbl1 a join tbl2
b on a.key = b.key
) subq2 group by key
) src2
on src1.key = src2.key
{code}
before shared work optimization, the physical plan in Tez
{code}
TS[0]-FIL[41]-SEL[2]-MAPJOIN[45]-GBY[10]-RS[11]-GBY[12]-MAPJOIN[47]-SEL[31]-FS[32]
TS[3]-FIL[42]-SEL[5]-RS[7]-MAPJOIN[45]
TS[14]-FIL[43]-SEL[16]-MAPJOIN[46]-GBY[24]-RS[25]-GBY[26]-RS[29]-MAPJOIN[47]
TS[17]-FIL[44]-SEL[19]-RS[21]-MAPJOIN[46]
{code}
after the optimization
{code}
TS[0]-FIL[41]-SEL[2]-MAPJOIN[45]-GBY[10]-RS[11]-GBY[12]-MAPJOIN[47]-SEL[31]-FS[32]
-RS[25]-GBY[26]-RS[29]-MAPJOIN[47]
TS[3]-FIL[42]-SEL[5]-RS[7]-MAPJOIN[45]
{code}
so the tez explain
before
{code}
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Tez
DagId: root_20171027044107_4977290a-6856-41c5-b0e3-24b4ec32c59e:1
Edges:
Map 1 <- Map 3 (BROADCAST_EDGE)
Map 4 <- Map 6 (BROADCAST_EDGE)
Reducer 2 <- Map 1 (SIMPLE_EDGE), Reducer 5 (BROADCAST_EDGE)
Reducer 5 <- Map 4 (SIMPLE_EDGE)
DagName: root_20171027044107_4977290a-6856-41c5-b0e3-24b4ec32c59e:1
Vertices:
Map 1
Map Operator Tree:
TableScan
alias: a
Statistics: Num rows: 10 Data size: 70 Basic stats: COMPLETE
Column stats: NONE
Filter Operator
predicate: key is not null (type: boolean)
Statistics: Num rows: 10 Data size: 70 Basic stats:
COMPLETE Column stats: NONE
Select Operator
expressions: key (type: int)
outputColumnNames: _col0
Statistics: Num rows: 10 Data size: 70 Basic stats:
COMPLETE Column stats: NONE
Map Join Operator
condition map:
Inner Join 0 to 1
keys:
0 _col0 (type: int)
1 _col0 (type: int)
outputColumnNames: _col0
input vertices:
1 Map 3
Statistics: Num rows: 11 Data size: 77 Basic stats:
COMPLETE Column stats: NONE
HybridGraceHashJoin: true
Group By Operator
aggregations: count()
keys: _col0 (type: int)
mode: hash
outputColumnNames: _col0, _col1
Statistics: Num rows: 11 Data size: 77 Basic stats:
COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: int)
sort order: +
Map-reduce partition columns: _col0 (type: int)
Statistics: Num rows: 11 Data size: 77 Basic stats:
COMPLETE Column stats: NONE
value expressions: _col1 (type: bigint)
Map 3
Map Operator Tree:
TableScan
alias: b
Statistics: Num rows: 10 Data size: 70 Basic stats: COMPLETE
Column stats: NONE
Filter Operator
predicate: key is not null (type: boolean)
Statistics: Num rows: 10 Data size: 70 Basic stats:
COMPLETE Column stats: NONE
Select Operator
expressions: key (type: int)
outputColumnNames: _col0
Statistics: Num rows: 10 Data size: 70 Basic stats:
COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: int)
sort order: +
Map-reduce partition columns: _col0 (type: int)
Statistics: Num rows: 10 Data size: 70 Basic stats:
COMPLETE Column stats: NONE
Map 4
Map Operator Tree:
TableScan
alias: a
Statistics: Num rows: 10 Data size: 70 Basic stats: COMPLETE
Column stats: NONE
Filter Operator
predicate: key is not null (type: boolean)
Statistics: Num rows: 10 Data size: 70 Basic stats:
COMPLETE Column stats: NONE
Select Operator
expressions: key (type: int)
outputColumnNames: _col0
Statistics: Num rows: 10 Data size: 70 Basic stats:
COMPLETE Column stats: NONE
Map Join Operator
condition map:
Inner Join 0 to 1
keys:
0 _col0 (type: int)
1 _col0 (type: int)
outputColumnNames: _col0
input vertices:
1 Map 6
Statistics: Num rows: 11 Data size: 77 Basic stats:
COMPLETE Column stats: NONE
HybridGraceHashJoin: true
Group By Operator
aggregations: count()
keys: _col0 (type: int)
mode: hash
outputColumnNames: _col0, _col1
Statistics: Num rows: 11 Data size: 77 Basic stats:
COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: int)
sort order: +
Map-reduce partition columns: _col0 (type: int)
Statistics: Num rows: 11 Data size: 77 Basic stats:
COMPLETE Column stats: NONE
value expressions: _col1 (type: bigint)
Map 6
Map Operator Tree:
TableScan
alias: b
Statistics: Num rows: 10 Data size: 70 Basic stats: COMPLETE
Column stats: NONE
Filter Operator
predicate: key is not null (type: boolean)
Statistics: Num rows: 10 Data size: 70 Basic stats:
COMPLETE Column stats: NONE
Select Operator
expressions: key (type: int)
outputColumnNames: _col0
Statistics: Num rows: 10 Data size: 70 Basic stats:
COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: int)
sort order: +
Map-reduce partition columns: _col0 (type: int)
Statistics: Num rows: 10 Data size: 70 Basic stats:
COMPLETE Column stats: NONE
Reducer 2
Reduce Operator Tree:
Group By Operator
aggregations: count(VALUE._col0)
keys: KEY._col0 (type: int)
mode: mergepartial
outputColumnNames: _col0, _col1
Statistics: Num rows: 5 Data size: 35 Basic stats: COMPLETE
Column stats: NONE
Map Join Operator
condition map:
Inner Join 0 to 1
keys:
0 _col0 (type: int)
1 _col0 (type: int)
outputColumnNames: _col0, _col1, _col3
input vertices:
1 Reducer 5
Statistics: Num rows: 5 Data size: 38 Basic stats: COMPLETE
Column stats: NONE
HybridGraceHashJoin: true
Select Operator
expressions: _col0 (type: int), _col3 (type: bigint), _col1
(type: bigint)
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 5 Data size: 38 Basic stats: COMPLETE
Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 5 Data size: 38 Basic stats:
COMPLETE Column stats: NONE
table:
input format:
org.apache.hadoop.mapred.SequenceFileInputFormat
output format:
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde:
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Reducer 5
Reduce Operator Tree:
Group By Operator
aggregations: count(VALUE._col0)
keys: KEY._col0 (type: int)
mode: mergepartial
outputColumnNames: _col0, _col1
Statistics: Num rows: 5 Data size: 35 Basic stats: COMPLETE
Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: int)
sort order: +
Map-reduce partition columns: _col0 (type: int)
Statistics: Num rows: 5 Data size: 35 Basic stats: COMPLETE
Column stats: NONE
value expressions: _col1 (type: bigint)
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
{code}
After
{code}
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Tez
DagId: root_20171027043345_8893c266-bde2-4f23-9914-0dfccc8be5ea:1
Edges:
Map 1 <- Map 4 (BROADCAST_EDGE)
Reducer 2 <- Map 1 (SIMPLE_EDGE), Reducer 3 (BROADCAST_EDGE)
Reducer 3 <- Map 1 (SIMPLE_EDGE)
DagName: root_20171027043345_8893c266-bde2-4f23-9914-0dfccc8be5ea:1
Vertices:
Map 1
Map Operator Tree:
TableScan
alias: a
Statistics: Num rows: 10 Data size: 70 Basic stats: COMPLETE
Column stats: NONE
Filter Operator
predicate: key is not null (type: boolean)
Statistics: Num rows: 10 Data size: 70 Basic stats:
COMPLETE Column stats: NONE
Select Operator
expressions: key (type: int)
outputColumnNames: _col0
Statistics: Num rows: 10 Data size: 70 Basic stats:
COMPLETE Column stats: NONE
Map Join Operator
condition map:
Inner Join 0 to 1
keys:
0 _col0 (type: int)
1 _col0 (type: int)
outputColumnNames: _col0
input vertices:
1 Map 4
Statistics: Num rows: 11 Data size: 77 Basic stats:
COMPLETE Column stats: NONE
HybridGraceHashJoin: true
Group By Operator
aggregations: count()
keys: _col0 (type: int)
mode: hash
outputColumnNames: _col0, _col1
Statistics: Num rows: 11 Data size: 77 Basic stats:
COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: int)
sort order: +
Map-reduce partition columns: _col0 (type: int)
Statistics: Num rows: 11 Data size: 77 Basic stats:
COMPLETE Column stats: NONE
value expressions: _col1 (type: bigint)
Reduce Output Operator
key expressions: _col0 (type: int)
sort order: +
Map-reduce partition columns: _col0 (type: int)
Statistics: Num rows: 11 Data size: 77 Basic stats:
COMPLETE Column stats: NONE
value expressions: _col1 (type: bigint)
Map 4
Map Operator Tree:
TableScan
alias: b
Statistics: Num rows: 10 Data size: 70 Basic stats: COMPLETE
Column stats: NONE
Filter Operator
predicate: key is not null (type: boolean)
Statistics: Num rows: 10 Data size: 70 Basic stats:
COMPLETE Column stats: NONE
Select Operator
expressions: key (type: int)
outputColumnNames: _col0
Statistics: Num rows: 10 Data size: 70 Basic stats:
COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: int)
sort order: +
Map-reduce partition columns: _col0 (type: int)
Statistics: Num rows: 10 Data size: 70 Basic stats:
COMPLETE Column stats: NONE
Reducer 2
Reduce Operator Tree:
Group By Operator
aggregations: count(VALUE._col0)
keys: KEY._col0 (type: int)
mode: mergepartial
outputColumnNames: _col0, _col1
Statistics: Num rows: 5 Data size: 35 Basic stats: COMPLETE
Column stats: NONE
Map Join Operator
condition map:
Inner Join 0 to 1
keys:
0 _col0 (type: int)
1 _col0 (type: int)
outputColumnNames: _col0, _col1, _col3
input vertices:
1 Reducer 3
Statistics: Num rows: 5 Data size: 38 Basic stats: COMPLETE
Column stats: NONE
HybridGraceHashJoin: true
Select Operator
expressions: _col0 (type: int), _col3 (type: bigint), _col1
(type: bigint)
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 5 Data size: 38 Basic stats: COMPLETE
Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 5 Data size: 38 Basic stats:
COMPLETE Column stats: NONE
table:
input format:
org.apache.hadoop.mapred.SequenceFileInputFormat
output format:
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde:
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Reducer 3
Reduce Operator Tree:
Group By Operator
aggregations: count(VALUE._col0)
keys: KEY._col0 (type: int)
mode: mergepartial
outputColumnNames: _col0, _col1
Statistics: Num rows: 5 Data size: 35 Basic stats: COMPLETE
Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: int)
sort order: +
Map-reduce partition columns: _col0 (type: int)
Statistics: Num rows: 5 Data size: 35 Basic stats: COMPLETE
Column stats: NONE
value expressions: _col1 (type: bigint)
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
{code}
We can see that there are only 2 Maps(Map1 and Map4) in the after explain (in
before explain, there are 4 Maps).
When i tried to change the physical plan for HOS like what HOT does. I found
change of the optimization
before
{code}
STAGE DEPENDENCIES:
Stage-3 is a root stage
Stage-2 depends on stages: Stage-3
Stage-4 depends on stages: Stage-2
Stage-1 depends on stages: Stage-4
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-3
Spark
DagName: root_20171027045349_0dc8e597-ffad-4da7-bbfc-c2b7533d9b58:4
Vertices:
Map 6
Map Operator Tree:
TableScan
alias: b
Statistics: Num rows: 10 Data size: 70 Basic stats: COMPLETE
Column stats: NONE
Filter Operator
predicate: key is not null (type: boolean)
Statistics: Num rows: 10 Data size: 70 Basic stats:
COMPLETE Column stats: NONE
Select Operator
expressions: key (type: int)
outputColumnNames: _col0
Statistics: Num rows: 10 Data size: 70 Basic stats:
COMPLETE Column stats: NONE
Spark HashTable Sink Operator
keys:
0 _col0 (type: int)
1 _col0 (type: int)
Local Work:
Map Reduce Local Work
Stage: Stage-2
Spark
Edges:
Reducer 5 <- Map 4 (GROUP, 1)
DagName: root_20171027045349_0dc8e597-ffad-4da7-bbfc-c2b7533d9b58:2
Vertices:
Map 4
Map Operator Tree:
TableScan
alias: a
Statistics: Num rows: 10 Data size: 70 Basic stats: COMPLETE
Column stats: NONE
Filter Operator
predicate: key is not null (type: boolean)
Statistics: Num rows: 10 Data size: 70 Basic stats:
COMPLETE Column stats: NONE
Select Operator
expressions: key (type: int)
outputColumnNames: _col0
Statistics: Num rows: 10 Data size: 70 Basic stats:
COMPLETE Column stats: NONE
Map Join Operator
condition map:
Inner Join 0 to 1
keys:
0 _col0 (type: int)
1 _col0 (type: int)
outputColumnNames: _col0
input vertices:
1 Map 6
Statistics: Num rows: 11 Data size: 77 Basic stats:
COMPLETE Column stats: NONE
Group By Operator
aggregations: count()
keys: _col0 (type: int)
mode: hash
outputColumnNames: _col0, _col1
Statistics: Num rows: 11 Data size: 77 Basic stats:
COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: int)
sort order: +
Map-reduce partition columns: _col0 (type: int)
Statistics: Num rows: 11 Data size: 77 Basic stats:
COMPLETE Column stats: NONE
value expressions: _col1 (type: bigint)
Local Work:
Map Reduce Local Work
Reducer 5
Local Work:
Map Reduce Local Work
Reduce Operator Tree:
Group By Operator
aggregations: count(VALUE._col0)
keys: KEY._col0 (type: int)
mode: mergepartial
outputColumnNames: _col0, _col1
Statistics: Num rows: 5 Data size: 35 Basic stats: COMPLETE
Column stats: NONE
Spark HashTable Sink Operator
keys:
0 _col0 (type: int)
1 _col0 (type: int)
Stage: Stage-4
Spark
DagName: root_20171027045349_0dc8e597-ffad-4da7-bbfc-c2b7533d9b58:3
Vertices:
Map 3
Map Operator Tree:
TableScan
alias: b
Statistics: Num rows: 10 Data size: 70 Basic stats: COMPLETE
Column stats: NONE
Filter Operator
predicate: key is not null (type: boolean)
Statistics: Num rows: 10 Data size: 70 Basic stats:
COMPLETE Column stats: NONE
Select Operator
expressions: key (type: int)
outputColumnNames: _col0
Statistics: Num rows: 10 Data size: 70 Basic stats:
COMPLETE Column stats: NONE
Spark HashTable Sink Operator
keys:
0 _col0 (type: int)
1 _col0 (type: int)
Local Work:
Map Reduce Local Work
Stage: Stage-1
Spark
Edges:
Reducer 2 <- Map 1 (GROUP, 1)
DagName: root_20171027045349_0dc8e597-ffad-4da7-bbfc-c2b7533d9b58:1
Vertices:
Map 1
Map Operator Tree:
TableScan
alias: a
Statistics: Num rows: 10 Data size: 70 Basic stats: COMPLETE
Column stats: NONE
Filter Operator
predicate: key is not null (type: boolean)
Statistics: Num rows: 10 Data size: 70 Basic stats:
COMPLETE Column stats: NONE
Select Operator
expressions: key (type: int)
outputColumnNames: _col0
Statistics: Num rows: 10 Data size: 70 Basic stats:
COMPLETE Column stats: NONE
Map Join Operator
condition map:
Inner Join 0 to 1
keys:
0 _col0 (type: int)
1 _col0 (type: int)
outputColumnNames: _col0
input vertices:
1 Map 3
Statistics: Num rows: 11 Data size: 77 Basic stats:
COMPLETE Column stats: NONE
Group By Operator
aggregations: count()
keys: _col0 (type: int)
mode: hash
outputColumnNames: _col0, _col1
Statistics: Num rows: 11 Data size: 77 Basic stats:
COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: int)
sort order: +
Map-reduce partition columns: _col0 (type: int)
Statistics: Num rows: 11 Data size: 77 Basic stats:
COMPLETE Column stats: NONE
value expressions: _col1 (type: bigint)
Local Work:
Map Reduce Local Work
Reducer 2
Local Work:
Map Reduce Local Work
Reduce Operator Tree:
Group By Operator
aggregations: count(VALUE._col0)
keys: KEY._col0 (type: int)
mode: mergepartial
outputColumnNames: _col0, _col1
Statistics: Num rows: 5 Data size: 35 Basic stats: COMPLETE
Column stats: NONE
Map Join Operator
condition map:
Inner Join 0 to 1
keys:
0 _col0 (type: int)
1 _col0 (type: int)
outputColumnNames: _col0, _col1, _col3
input vertices:
1 Reducer 5
Statistics: Num rows: 5 Data size: 38 Basic stats: COMPLETE
Column stats: NONE
Select Operator
expressions: _col0 (type: int), _col3 (type: bigint), _col1
(type: bigint)
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 5 Data size: 38 Basic stats: COMPLETE
Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 5 Data size: 38 Basic stats:
COMPLETE Column stats: NONE
table:
input format:
org.apache.hadoop.mapred.SequenceFileInputFormat
output format:
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde:
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
{code}
After
{code}
STAGE DEPENDENCIES:
Stage-3 is a root stage
Stage-2 depends on stages: Stage-3
Stage-4 depends on stages: Stage-2
Stage-1 depends on stages: Stage-4
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-3
Spark
DagName: root_20171027045324_b7f21ae7-9bcc-4077-8577-e6c6747dfcff:3
Vertices:
Map 4
Map Operator Tree:
TableScan
alias: b
Statistics: Num rows: 10 Data size: 70 Basic stats: COMPLETE
Column stats: NONE
Filter Operator
predicate: key is not null (type: boolean)
Statistics: Num rows: 10 Data size: 70 Basic stats:
COMPLETE Column stats: NONE
Select Operator
expressions: key (type: int)
outputColumnNames: _col0
Statistics: Num rows: 10 Data size: 70 Basic stats:
COMPLETE Column stats: NONE
Spark HashTable Sink Operator
keys:
0 _col0 (type: int)
1 _col0 (type: int)
Local Work:
Map Reduce Local Work
Stage: Stage-2
Spark
Edges:
Reducer 3 <- Map 6 (GROUP, 1)
DagName: root_20171027045324_b7f21ae7-9bcc-4077-8577-e6c6747dfcff:2
Vertices:
Map 6
Map Operator Tree:
TableScan
alias: a
Statistics: Num rows: 10 Data size: 70 Basic stats: COMPLETE
Column stats: NONE
Filter Operator
predicate: key is not null (type: boolean)
Statistics: Num rows: 10 Data size: 70 Basic stats:
COMPLETE Column stats: NONE
Select Operator
expressions: key (type: int)
outputColumnNames: _col0
Statistics: Num rows: 10 Data size: 70 Basic stats:
COMPLETE Column stats: NONE
Map Join Operator
condition map:
Inner Join 0 to 1
keys:
0 _col0 (type: int)
1 _col0 (type: int)
outputColumnNames: _col0
input vertices:
1 Map 4
Statistics: Num rows: 11 Data size: 77 Basic stats:
COMPLETE Column stats: NONE
Group By Operator
aggregations: count()
keys: _col0 (type: int)
mode: hash
outputColumnNames: _col0, _col1
Statistics: Num rows: 11 Data size: 77 Basic stats:
COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: int)
sort order: +
Map-reduce partition columns: _col0 (type: int)
Statistics: Num rows: 11 Data size: 77 Basic stats:
COMPLETE Column stats: NONE
value expressions: _col1 (type: bigint)
Local Work:
Map Reduce Local Work
Reducer 3
Local Work:
Map Reduce Local Work
Reduce Operator Tree:
Group By Operator
aggregations: count(VALUE._col0)
keys: KEY._col0 (type: int)
mode: mergepartial
outputColumnNames: _col0, _col1
Statistics: Num rows: 5 Data size: 35 Basic stats: COMPLETE
Column stats: NONE
Spark HashTable Sink Operator
keys:
0 _col0 (type: int)
1 _col0 (type: int)
Stage: Stage-4
Spark
DagName: root_20171027045324_b7f21ae7-9bcc-4077-8577-e6c6747dfcff:4
Vertices:
Map 4
Map Operator Tree:
TableScan
alias: b
Statistics: Num rows: 10 Data size: 70 Basic stats: COMPLETE
Column stats: NONE
Filter Operator
predicate: key is not null (type: boolean)
Statistics: Num rows: 10 Data size: 70 Basic stats:
COMPLETE Column stats: NONE
Select Operator
expressions: key (type: int)
outputColumnNames: _col0
Statistics: Num rows: 10 Data size: 70 Basic stats:
COMPLETE Column stats: NONE
Spark HashTable Sink Operator
keys:
0 _col0 (type: int)
1 _col0 (type: int)
Local Work:
Map Reduce Local Work
Stage: Stage-1
Spark
Edges:
Reducer 2 <- Map 5 (GROUP, 1)
DagName: root_20171027045324_b7f21ae7-9bcc-4077-8577-e6c6747dfcff:1
Vertices:
Map 5
Map Operator Tree:
TableScan
alias: a
Statistics: Num rows: 10 Data size: 70 Basic stats: COMPLETE
Column stats: NONE
Filter Operator
predicate: key is not null (type: boolean)
Statistics: Num rows: 10 Data size: 70 Basic stats:
COMPLETE Column stats: NONE
Select Operator
expressions: key (type: int)
outputColumnNames: _col0
Statistics: Num rows: 10 Data size: 70 Basic stats:
COMPLETE Column stats: NONE
Map Join Operator
condition map:
Inner Join 0 to 1
keys:
0 _col0 (type: int)
1 _col0 (type: int)
outputColumnNames: _col0
input vertices:
1 Map 4
Statistics: Num rows: 11 Data size: 77 Basic stats:
COMPLETE Column stats: NONE
Group By Operator
aggregations: count()
keys: _col0 (type: int)
mode: hash
outputColumnNames: _col0, _col1
Statistics: Num rows: 11 Data size: 77 Basic stats:
COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: int)
sort order: +
Map-reduce partition columns: _col0 (type: int)
Statistics: Num rows: 11 Data size: 77 Basic stats:
COMPLETE Column stats: NONE
value expressions: _col1 (type: bigint)
Local Work:
Map Reduce Local Work
Reducer 2
Local Work:
Map Reduce Local Work
Reduce Operator Tree:
Group By Operator
aggregations: count(VALUE._col0)
keys: KEY._col0 (type: int)
mode: mergepartial
outputColumnNames: _col0, _col1
Statistics: Num rows: 5 Data size: 35 Basic stats: COMPLETE
Column stats: NONE
Map Join Operator
condition map:
Inner Join 0 to 1
keys:
0 _col0 (type: int)
1 _col0 (type: int)
outputColumnNames: _col0, _col1, _col3
input vertices:
1 Reducer 3
Statistics: Num rows: 5 Data size: 38 Basic stats: COMPLETE
Column stats: NONE
Select Operator
expressions: _col0 (type: int), _col3 (type: bigint), _col1
(type: bigint)
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 5 Data size: 38 Basic stats: COMPLETE
Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 5 Data size: 38 Basic stats:
COMPLETE Column stats: NONE
table:
input format:
org.apache.hadoop.mapred.SequenceFileInputFormat
output format:
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde:
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
{code}
only the Map about table {{b}} are merged to 1 Map, there are still two Maps
about table {{a}}(Map1,Map4).
The reason causes this is because
[GenSparkWork|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkCompiler.java#L421]
will split following physical operator tree once encounting RS
{code}
TS[0]-FIL[41]-SEL[2]-MAPJOIN[45]-GBY[10]-RS[11]-GBY[12]-MAPJOIN[47]-SEL[31]-FS[32]
-RS[25]-GBY[26]-RS[29]-MAPJOIN[47]
TS[3]-FIL[42]-SEL[5]-RS[7]-MAPJOIN[45]
{code} to
{code}
Map1:TS[0]-FIL[41]-SEL[2]-MAPJOIN[45]-GBY[10]-RS[11]
Reduce2:GBY[12]-MAPJOIN[47]-SEL[31]-FS[32]
Map3:TS[0]-FIL[41]-SEL[2]-MAPJOIN[45]-GBY[10]-RS[25]
Reduce4:GBY[26]-RS[29]
Map5:TS[3]-FIL[42]-SEL[5]-RS[7]
{code}
> Enable SharedWorkOptimizer in tez on HOS
> ----------------------------------------
>
> Key: HIVE-17486
> URL: https://issues.apache.org/jira/browse/HIVE-17486
> Project: Hive
> Issue Type: Bug
> Reporter: liyunzhang
> Assignee: liyunzhang
>
> in HIVE-16602, Implement shared scans with Tez.
> Given a query plan, the goal is to identify scans on input tables that can be
> merged so the data is read only once. Optimization will be carried out at the
> physical level. In Hive on Spark, it caches the result of spark work if the
> spark work is used by more than 1 child spark work. After sharedWorkOptimizer
> is enabled in physical plan in HoS, the identical table scans are merged to 1
> table scan. This result of table scan will be used by more 1 child spark
> work. Thus we need not do the same computation because of cache mechanism.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)