[
https://issues.apache.org/jira/browse/HIVE-16602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16233668#comment-16233668
]
liyunzhang commented on HIVE-16602:
-----------------------------------
[~jcamachorodriguez]: I indeed found the performance improvement in a not very
good hw env(less hard disk and less memory). I guess this is because shared
scans optimization can help reduce duplicated table scan because table scan may
take long time in a bad hw env. for
example(DS/[query28.sql|https://github.com/apache/hive/blob/master/ql/src/test/queries/clientpositive/perf/query28.q]
before the shared scan optimization, the explain
{code}
Vertex dependency in root stage
Reducer 11 <- Map 10 (SIMPLE_EDGE)
Reducer 13 <- Map 12 (SIMPLE_EDGE)
Reducer 2 <- Map 1 (SIMPLE_EDGE)
Reducer 3 <- Reducer 11 (CUSTOM_SIMPLE_EDGE), Reducer 13 (CUSTOM_SIMPLE_EDGE),
Reducer 2 (CUSTOM_SIMPLE_EDGE), Reducer 5 (CUSTOM_SIMPLE_EDGE), Reducer 7
(CUSTOM_SIMPLE_EDGE), Reducer 9 (CUSTOM_SIMPLE_EDGE)
Reducer 5 <- Map 4 (SIMPLE_EDGE)
Reducer 7 <- Map 6 (SIMPLE_EDGE)
Reducer 9 <- Map 8 (SIMPLE_EDGE)
Stage-0
Fetch Operator
limit:100
Stage-1
Reducer 3
File Output Operator [FS_51]
Limit [LIM_50] (rows=1 width=2497)
Number of rows:100
Select Operator [SEL_49] (rows=1 width=2497)
Output:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8","_col9","_col10","_col11","_col12","_col13","_col14","_col15","_col16","_col17"]
Merge Join Operator [MERGEJOIN_58] (rows=1 width=2497)
Conds:(Inner),(Inner),(Inner),(Inner),(Inner),Output:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8","_col9","_col10","_col11","_col12","_col13","_col14","_col15","_col16","_col17"]
<-Reducer 11 [CUSTOM_SIMPLE_EDGE]
PARTITION_ONLY_SHUFFLE [RS_46]
Group By Operator [GBY_33] (rows=1 width=416)
Output:["_col0","_col1","_col2"],aggregations:["avg(VALUE._col0)","count(VALUE._col1)","count(DISTINCT
KEY._col0:0._col0)"]
<-Map 10 [SIMPLE_EDGE]
SHUFFLE [RS_32]
Group By Operator [GBY_31] (rows=21333171 width=88)
Output:["_col0","_col1","_col2","_col3"],aggregations:["avg(ss_list_price)","count(ss_list_price)","count(DISTINCT
ss_list_price)"],keys:ss_list_price
Select Operator [SEL_30] (rows=21333171 width=88)
Output:["ss_list_price"]
Filter Operator [FIL_56] (rows=21333171 width=88)
predicate:(ss_quantity BETWEEN 11 AND 15 and
(ss_list_price BETWEEN 66 AND 76 or ss_coupon_amt BETWEEN 920 AND 1920 or
ss_wholesale_cost BETWEEN 4 AND 24))
TableScan [TS_28] (rows=575995635 width=88)
default@store_sales,store_sales,Tbl:COMPLETE,Col:NONE,Output:["ss_quantity","ss_wholesale_cost","ss_list_price","ss_coupon_amt"]
<-Reducer 13 [CUSTOM_SIMPLE_EDGE]
PARTITION_ONLY_SHUFFLE [RS_47]
Group By Operator [GBY_40] (rows=1 width=416)
Output:["_col0","_col1","_col2"],aggregations:["avg(VALUE._col0)","count(VALUE._col1)","count(DISTINCT
KEY._col0:0._col0)"]
<-Map 12 [SIMPLE_EDGE]
SHUFFLE [RS_39]
Group By Operator [GBY_38] (rows=21333171 width=88)
Output:["_col0","_col1","_col2","_col3"],aggregations:["avg(ss_list_price)","count(ss_list_price)","count(DISTINCT
ss_list_price)"],keys:ss_list_price
Select Operator [SEL_37] (rows=21333171 width=88)
Output:["ss_list_price"]
Filter Operator [FIL_57] (rows=21333171 width=88)
predicate:(ss_quantity BETWEEN 6 AND 10 and
(ss_list_price BETWEEN 91 AND 101 or ss_coupon_amt BETWEEN 1430 AND 2430 or
ss_wholesale_cost BETWEEN 32 AND 52))
TableScan [TS_35] (rows=575995635 width=88)
default@store_sales,store_sales,Tbl:COMPLETE,Col:NONE,Output:["ss_quantity","ss_wholesale_cost","ss_list_price","ss_coupon_amt"]
<-Reducer 2 [CUSTOM_SIMPLE_EDGE]
PARTITION_ONLY_SHUFFLE [RS_42]
Group By Operator [GBY_5] (rows=1 width=416)
Output:["_col0","_col1","_col2"],aggregations:["avg(VALUE._col0)","count(VALUE._col1)","count(DISTINCT
KEY._col0:0._col0)"]
<-Map 1 [SIMPLE_EDGE]
SHUFFLE [RS_4]
Group By Operator [GBY_3] (rows=21333171 width=88)
Output:["_col0","_col1","_col2","_col3"],aggregations:["avg(ss_list_price)","count(ss_list_price)","count(DISTINCT
ss_list_price)"],keys:ss_list_price
Select Operator [SEL_2] (rows=21333171 width=88)
Output:["ss_list_price"]
Filter Operator [FIL_52] (rows=21333171 width=88)
predicate:(ss_quantity BETWEEN 0 AND 5 and
(ss_list_price BETWEEN 11 AND 21 or ss_coupon_amt BETWEEN 460 AND 1460 or
ss_wholesale_cost BETWEEN 14 AND 34))
TableScan [TS_0] (rows=575995635 width=88)
default@store_sales,store_sales,Tbl:COMPLETE,Col:NONE,Output:["ss_quantity","ss_wholesale_cost","ss_list_price","ss_coupon_amt"]
<-Reducer 5 [CUSTOM_SIMPLE_EDGE]
PARTITION_ONLY_SHUFFLE [RS_43]
Group By Operator [GBY_12] (rows=1 width=416)
Output:["_col0","_col1","_col2"],aggregations:["avg(VALUE._col0)","count(VALUE._col1)","count(DISTINCT
KEY._col0:0._col0)"]
<-Map 4 [SIMPLE_EDGE]
SHUFFLE [RS_11]
Group By Operator [GBY_10] (rows=21333171 width=88)
Output:["_col0","_col1","_col2","_col3"],aggregations:["avg(ss_list_price)","count(ss_list_price)","count(DISTINCT
ss_list_price)"],keys:ss_list_price
Select Operator [SEL_9] (rows=21333171 width=88)
Output:["ss_list_price"]
Filter Operator [FIL_53] (rows=21333171 width=88)
predicate:(ss_quantity BETWEEN 26 AND 30 and
(ss_list_price BETWEEN 28 AND 38 or ss_coupon_amt BETWEEN 2513 AND 3513 or
ss_wholesale_cost BETWEEN 42 AND 62))
TableScan [TS_7] (rows=575995635 width=88)
default@store_sales,store_sales,Tbl:COMPLETE,Col:NONE,Output:["ss_quantity","ss_wholesale_cost","ss_list_price","ss_coupon_amt"]
<-Reducer 7 [CUSTOM_SIMPLE_EDGE]
PARTITION_ONLY_SHUFFLE [RS_44]
Group By Operator [GBY_19] (rows=1 width=416)
Output:["_col0","_col1","_col2"],aggregations:["avg(VALUE._col0)","count(VALUE._col1)","count(DISTINCT
KEY._col0:0._col0)"]
<-Map 6 [SIMPLE_EDGE]
SHUFFLE [RS_18]
Group By Operator [GBY_17] (rows=21333171 width=88)
Output:["_col0","_col1","_col2","_col3"],aggregations:["avg(ss_list_price)","count(ss_list_price)","count(DISTINCT
ss_list_price)"],keys:ss_list_price
Select Operator [SEL_16] (rows=21333171 width=88)
Output:["ss_list_price"]
Filter Operator [FIL_54] (rows=21333171 width=88)
predicate:(ss_quantity BETWEEN 21 AND 25 and
(ss_list_price BETWEEN 135 AND 145 or ss_coupon_amt BETWEEN 14180 AND 15180 or
ss_wholesale_cost BETWEEN 38 AND 58))
TableScan [TS_14] (rows=575995635 width=88)
default@store_sales,store_sales,Tbl:COMPLETE,Col:NONE,Output:["ss_quantity","ss_wholesale_cost","ss_list_price","ss_coupon_amt"]
<-Reducer 9 [CUSTOM_SIMPLE_EDGE]
PARTITION_ONLY_SHUFFLE [RS_45]
Group By Operator [GBY_26] (rows=1 width=416)
Output:["_col0","_col1","_col2"],aggregations:["avg(VALUE._col0)","count(VALUE._col1)","count(DISTINCT
KEY._col0:0._col0)"]
<-Map 8 [SIMPLE_EDGE]
SHUFFLE [RS_25]
Group By Operator [GBY_24] (rows=21333171 width=88)
Output:["_col0","_col1","_col2","_col3"],aggregations:["avg(ss_list_price)","count(ss_list_price)","count(DISTINCT
ss_list_price)"],keys:ss_list_price
Select Operator [SEL_23] (rows=21333171 width=88)
Output:["ss_list_price"]
Filter Operator [FIL_55] (rows=21333171 width=88)
predicate:(ss_quantity BETWEEN 16 AND 20 and
(ss_list_price BETWEEN 142 AND 152 or ss_coupon_amt BETWEEN 3054 AND 4054 or
ss_wholesale_cost BETWEEN 80 AND 100))
TableScan [TS_21] (rows=575995635 width=88)
default@store_sales,store_sales,Tbl:COMPLETE,Col:NONE,Output:["ss_quantity","ss_wholesale_cost","ss_list_price","ss_coupon_amt"]
{code}
After the shared scan optimization, explain
{code}
Vertex dependency in root stage
Reducer 2 <- Map 1 (SIMPLE_EDGE)
Reducer 3 <- Reducer 2 (CUSTOM_SIMPLE_EDGE), Reducer 4 (CUSTOM_SIMPLE_EDGE),
Reducer 5 (CUSTOM_SIMPLE_EDGE), Reducer 6 (CUSTOM_SIMPLE_EDGE), Reducer 7
(CUSTOM_SIMPLE_EDGE), Reducer 8 (CUSTOM_SIMPLE_EDGE)
Reducer 4 <- Map 1 (SIMPLE_EDGE)
Reducer 5 <- Map 1 (SIMPLE_EDGE)
Reducer 6 <- Map 1 (SIMPLE_EDGE)
Reducer 7 <- Map 1 (SIMPLE_EDGE)
Reducer 8 <- Map 1 (SIMPLE_EDGE)
Stage-0
Fetch Operator
limit:100
Stage-1
Reducer 3
File Output Operator [FS_51]
Limit [LIM_50] (rows=1 width=2497)
Number of rows:100
Select Operator [SEL_49] (rows=1 width=2497)
Output:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8","_col9","_col10","_col11","_col12","_col13","_col14","_col15","_col16","_col17"]
Merge Join Operator [MERGEJOIN_58] (rows=1 width=2497)
Conds:(Inner),(Inner),(Inner),(Inner),(Inner),Output:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8","_col9","_col10","_col11","_col12","_col13","_col14","_col15","_col16","_col17"]
<-Reducer 2 [CUSTOM_SIMPLE_EDGE]
PARTITION_ONLY_SHUFFLE [RS_42]
Group By Operator [GBY_5] (rows=1 width=416)
Output:["_col0","_col1","_col2"],aggregations:["avg(VALUE._col0)","count(VALUE._col1)","count(DISTINCT
KEY._col0:0._col0)"]
<-Map 1 [SIMPLE_EDGE]
SHUFFLE [RS_4]
Group By Operator [GBY_3] (rows=21333171 width=88)
Output:["_col0","_col1","_col2","_col3"],aggregations:["avg(ss_list_price)","count(ss_list_price)","count(DISTINCT
ss_list_price)"],keys:ss_list_price
Select Operator [SEL_2] (rows=21333171 width=88)
Output:["ss_list_price"]
Filter Operator [FIL_52] (rows=21333171 width=88)
predicate:(ss_quantity BETWEEN 0 AND 5 and
(ss_list_price BETWEEN 11 AND 21 or ss_coupon_amt BETWEEN 460 AND 1460 or
ss_wholesale_cost BETWEEN 14 AND 34))
TableScan [TS_0] (rows=575995635 width=88)
default@store_sales,store_sales,Tbl:COMPLETE,Col:NONE,Output:["ss_quantity","ss_wholesale_cost","ss_list_price","ss_coupon_amt"]
<-Reducer 4 [CUSTOM_SIMPLE_EDGE]
PARTITION_ONLY_SHUFFLE [RS_43]
Group By Operator [GBY_12] (rows=1 width=416)
Output:["_col0","_col1","_col2"],aggregations:["avg(VALUE._col0)","count(VALUE._col1)","count(DISTINCT
KEY._col0:0._col0)"]
<-Map 1 [SIMPLE_EDGE]
SHUFFLE [RS_11]
Group By Operator [GBY_10] (rows=21333171 width=88)
Output:["_col0","_col1","_col2","_col3"],aggregations:["avg(ss_list_price)","count(ss_list_price)","count(DISTINCT
ss_list_price)"],keys:ss_list_price
Select Operator [SEL_9] (rows=21333171 width=88)
Output:["ss_list_price"]
Filter Operator [FIL_53] (rows=21333171 width=88)
predicate:(ss_quantity BETWEEN 26 AND 30 and
(ss_list_price BETWEEN 28 AND 38 or ss_coupon_amt BETWEEN 2513 AND 3513 or
ss_wholesale_cost BETWEEN 42 AND 62))
Please refer to the previous TableScan [TS_0]
<-Reducer 5 [CUSTOM_SIMPLE_EDGE]
PARTITION_ONLY_SHUFFLE [RS_44]
Group By Operator [GBY_19] (rows=1 width=416)
Output:["_col0","_col1","_col2"],aggregations:["avg(VALUE._col0)","count(VALUE._col1)","count(DISTINCT
KEY._col0:0._col0)"]
<-Map 1 [SIMPLE_EDGE]
SHUFFLE [RS_18]
Group By Operator [GBY_17] (rows=21333171 width=88)
Output:["_col0","_col1","_col2","_col3"],aggregations:["avg(ss_list_price)","count(ss_list_price)","count(DISTINCT
ss_list_price)"],keys:ss_list_price
Select Operator [SEL_16] (rows=21333171 width=88)
Output:["ss_list_price"]
Filter Operator [FIL_54] (rows=21333171 width=88)
predicate:(ss_quantity BETWEEN 21 AND 25 and
(ss_list_price BETWEEN 135 AND 145 or ss_coupon_amt BETWEEN 14180 AND 15180 or
ss_wholesale_cost BETWEEN 38 AND 58))
Please refer to the previous TableScan [TS_0]
<-Reducer 6 [CUSTOM_SIMPLE_EDGE]
PARTITION_ONLY_SHUFFLE [RS_45]
Group By Operator [GBY_26] (rows=1 width=416)
Output:["_col0","_col1","_col2"],aggregations:["avg(VALUE._col0)","count(VALUE._col1)","count(DISTINCT
KEY._col0:0._col0)"]
<-Map 1 [SIMPLE_EDGE]
SHUFFLE [RS_25]
Group By Operator [GBY_24] (rows=21333171 width=88)
Output:["_col0","_col1","_col2","_col3"],aggregations:["avg(ss_list_price)","count(ss_list_price)","count(DISTINCT
ss_list_price)"],keys:ss_list_price
Select Operator [SEL_23] (rows=21333171 width=88)
Output:["ss_list_price"]
Filter Operator [FIL_55] (rows=21333171 width=88)
predicate:(ss_quantity BETWEEN 16 AND 20 and
(ss_list_price BETWEEN 142 AND 152 or ss_coupon_amt BETWEEN 3054 AND 4054 or
ss_wholesale_cost BETWEEN 80 AND 100))
Please refer to the previous TableScan [TS_0]
<-Reducer 7 [CUSTOM_SIMPLE_EDGE]
PARTITION_ONLY_SHUFFLE [RS_46]
Group By Operator [GBY_33] (rows=1 width=416)
Output:["_col0","_col1","_col2"],aggregations:["avg(VALUE._col0)","count(VALUE._col1)","count(DISTINCT
KEY._col0:0._col0)"]
<-Map 1 [SIMPLE_EDGE]
SHUFFLE [RS_32]
Group By Operator [GBY_31] (rows=21333171 width=88)
Output:["_col0","_col1","_col2","_col3"],aggregations:["avg(ss_list_price)","count(ss_list_price)","count(DISTINCT
ss_list_price)"],keys:ss_list_price
Select Operator [SEL_30] (rows=21333171 width=88)
Output:["ss_list_price"]
Filter Operator [FIL_56] (rows=21333171 width=88)
predicate:(ss_quantity BETWEEN 11 AND 15 and
(ss_list_price BETWEEN 66 AND 76 or ss_coupon_amt BETWEEN 920 AND 1920 or
ss_wholesale_cost BETWEEN 4 AND 24))
Please refer to the previous TableScan [TS_0]
<-Reducer 8 [CUSTOM_SIMPLE_EDGE]
PARTITION_ONLY_SHUFFLE [RS_47]
Group By Operator [GBY_40] (rows=1 width=416)
Output:["_col0","_col1","_col2"],aggregations:["avg(VALUE._col0)","count(VALUE._col1)","count(DISTINCT
KEY._col0:0._col0)"]
<-Map 1 [SIMPLE_EDGE]
SHUFFLE [RS_39]
Group By Operator [GBY_38] (rows=21333171 width=88)
Output:["_col0","_col1","_col2","_col3"],aggregations:["avg(ss_list_price)","count(ss_list_price)","count(DISTINCT
ss_list_price)"],keys:ss_list_price
Select Operator [SEL_37] (rows=21333171 width=88)
Output:["ss_list_price"]
Filter Operator [FIL_57] (rows=21333171 width=88)
predicate:(ss_quantity BETWEEN 6 AND 10 and
(ss_list_price BETWEEN 91 AND 101 or ss_coupon_amt BETWEEN 1430 AND 2430 or
ss_wholesale_cost BETWEEN 32 AND 52))
Please refer to the previous TableScan [TS_0]
{code}
Before there are 6 Maps, After there are only 1 Map. In After explain, there
are 6 edges from Map1 to Reducer 2, Reducer 4, Reducer 5, Reducer 6, Reducer 7
and Reducer 8. Here i want to ask question whether these edges can be shared
so the table scan can be reused or not?
> Implement shared scans with Tez
> -------------------------------
>
> Key: HIVE-16602
> URL: https://issues.apache.org/jira/browse/HIVE-16602
> Project: Hive
> Issue Type: New Feature
> Components: Physical Optimizer
> Affects Versions: 3.0.0
> Reporter: Jesus Camacho Rodriguez
> Assignee: Jesus Camacho Rodriguez
> Priority: Major
> Fix For: 3.0.0
>
> Attachments: HIVE-16602.01.patch, HIVE-16602.02.patch,
> HIVE-16602.03.patch, HIVE-16602.04.patch, HIVE-16602.patch
>
>
> Given a query plan, the goal is to identify scans on input tables that can be
> merged so the data is read only once. Optimization will be carried out at the
> physical level.
> In the longer term, identification of equivalent expressions and
> reutilization of intermediary results should be done at the logical layer via
> Spool operator.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)