[ 
https://issues.apache.org/jira/browse/HIVE-16602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16233668#comment-16233668
 ] 

liyunzhang commented on HIVE-16602:
-----------------------------------

[~jcamachorodriguez]:  I indeed found the performance improvement in a not very 
good hw env(less hard disk and less memory). I guess this is because shared 
scans optimization can help reduce duplicated table scan because table scan may 
take long time in a bad hw env. for 
example(DS/[query28.sql|https://github.com/apache/hive/blob/master/ql/src/test/queries/clientpositive/perf/query28.q]
before the shared scan optimization, the explain
{code}
Vertex dependency in root stage
Reducer 11 <- Map 10 (SIMPLE_EDGE)
Reducer 13 <- Map 12 (SIMPLE_EDGE)
Reducer 2 <- Map 1 (SIMPLE_EDGE)
Reducer 3 <- Reducer 11 (CUSTOM_SIMPLE_EDGE), Reducer 13 (CUSTOM_SIMPLE_EDGE), 
Reducer 2 (CUSTOM_SIMPLE_EDGE), Reducer 5 (CUSTOM_SIMPLE_EDGE), Reducer 7 
(CUSTOM_SIMPLE_EDGE), Reducer 9 (CUSTOM_SIMPLE_EDGE)
Reducer 5 <- Map 4 (SIMPLE_EDGE)
Reducer 7 <- Map 6 (SIMPLE_EDGE)
Reducer 9 <- Map 8 (SIMPLE_EDGE)

Stage-0
  Fetch Operator
    limit:100
    Stage-1
      Reducer 3
      File Output Operator [FS_51]
        Limit [LIM_50] (rows=1 width=2497)
          Number of rows:100
          Select Operator [SEL_49] (rows=1 width=2497)
            
Output:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8","_col9","_col10","_col11","_col12","_col13","_col14","_col15","_col16","_col17"]
            Merge Join Operator [MERGEJOIN_58] (rows=1 width=2497)
              
Conds:(Inner),(Inner),(Inner),(Inner),(Inner),Output:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8","_col9","_col10","_col11","_col12","_col13","_col14","_col15","_col16","_col17"]
            <-Reducer 11 [CUSTOM_SIMPLE_EDGE]
              PARTITION_ONLY_SHUFFLE [RS_46]
                Group By Operator [GBY_33] (rows=1 width=416)
                  
Output:["_col0","_col1","_col2"],aggregations:["avg(VALUE._col0)","count(VALUE._col1)","count(DISTINCT
 KEY._col0:0._col0)"]
                <-Map 10 [SIMPLE_EDGE]
                  SHUFFLE [RS_32]
                    Group By Operator [GBY_31] (rows=21333171 width=88)
                      
Output:["_col0","_col1","_col2","_col3"],aggregations:["avg(ss_list_price)","count(ss_list_price)","count(DISTINCT
 ss_list_price)"],keys:ss_list_price
                      Select Operator [SEL_30] (rows=21333171 width=88)
                        Output:["ss_list_price"]
                        Filter Operator [FIL_56] (rows=21333171 width=88)
                          predicate:(ss_quantity BETWEEN 11 AND 15 and 
(ss_list_price BETWEEN 66 AND 76 or ss_coupon_amt BETWEEN 920 AND 1920 or 
ss_wholesale_cost BETWEEN 4 AND 24))
                          TableScan [TS_28] (rows=575995635 width=88)
                            
default@store_sales,store_sales,Tbl:COMPLETE,Col:NONE,Output:["ss_quantity","ss_wholesale_cost","ss_list_price","ss_coupon_amt"]
            <-Reducer 13 [CUSTOM_SIMPLE_EDGE]
              PARTITION_ONLY_SHUFFLE [RS_47]
                Group By Operator [GBY_40] (rows=1 width=416)
                  
Output:["_col0","_col1","_col2"],aggregations:["avg(VALUE._col0)","count(VALUE._col1)","count(DISTINCT
 KEY._col0:0._col0)"]
                <-Map 12 [SIMPLE_EDGE]
                  SHUFFLE [RS_39]
                    Group By Operator [GBY_38] (rows=21333171 width=88)
                      
Output:["_col0","_col1","_col2","_col3"],aggregations:["avg(ss_list_price)","count(ss_list_price)","count(DISTINCT
 ss_list_price)"],keys:ss_list_price
                      Select Operator [SEL_37] (rows=21333171 width=88)
                        Output:["ss_list_price"]
                        Filter Operator [FIL_57] (rows=21333171 width=88)
                          predicate:(ss_quantity BETWEEN 6 AND 10 and 
(ss_list_price BETWEEN 91 AND 101 or ss_coupon_amt BETWEEN 1430 AND 2430 or 
ss_wholesale_cost BETWEEN 32 AND 52))
                          TableScan [TS_35] (rows=575995635 width=88)
                            
default@store_sales,store_sales,Tbl:COMPLETE,Col:NONE,Output:["ss_quantity","ss_wholesale_cost","ss_list_price","ss_coupon_amt"]
            <-Reducer 2 [CUSTOM_SIMPLE_EDGE]
              PARTITION_ONLY_SHUFFLE [RS_42]
                Group By Operator [GBY_5] (rows=1 width=416)
                  
Output:["_col0","_col1","_col2"],aggregations:["avg(VALUE._col0)","count(VALUE._col1)","count(DISTINCT
 KEY._col0:0._col0)"]
                <-Map 1 [SIMPLE_EDGE]
                  SHUFFLE [RS_4]
                    Group By Operator [GBY_3] (rows=21333171 width=88)
                      
Output:["_col0","_col1","_col2","_col3"],aggregations:["avg(ss_list_price)","count(ss_list_price)","count(DISTINCT
 ss_list_price)"],keys:ss_list_price
                      Select Operator [SEL_2] (rows=21333171 width=88)
                        Output:["ss_list_price"]
                        Filter Operator [FIL_52] (rows=21333171 width=88)
                          predicate:(ss_quantity BETWEEN 0 AND 5 and 
(ss_list_price BETWEEN 11 AND 21 or ss_coupon_amt BETWEEN 460 AND 1460 or 
ss_wholesale_cost BETWEEN 14 AND 34))
                          TableScan [TS_0] (rows=575995635 width=88)
                            
default@store_sales,store_sales,Tbl:COMPLETE,Col:NONE,Output:["ss_quantity","ss_wholesale_cost","ss_list_price","ss_coupon_amt"]
            <-Reducer 5 [CUSTOM_SIMPLE_EDGE]
              PARTITION_ONLY_SHUFFLE [RS_43]
                Group By Operator [GBY_12] (rows=1 width=416)
                  
Output:["_col0","_col1","_col2"],aggregations:["avg(VALUE._col0)","count(VALUE._col1)","count(DISTINCT
 KEY._col0:0._col0)"]
                <-Map 4 [SIMPLE_EDGE]
                  SHUFFLE [RS_11]
                    Group By Operator [GBY_10] (rows=21333171 width=88)
                      
Output:["_col0","_col1","_col2","_col3"],aggregations:["avg(ss_list_price)","count(ss_list_price)","count(DISTINCT
 ss_list_price)"],keys:ss_list_price
                      Select Operator [SEL_9] (rows=21333171 width=88)
                        Output:["ss_list_price"]
                        Filter Operator [FIL_53] (rows=21333171 width=88)
                          predicate:(ss_quantity BETWEEN 26 AND 30 and 
(ss_list_price BETWEEN 28 AND 38 or ss_coupon_amt BETWEEN 2513 AND 3513 or 
ss_wholesale_cost BETWEEN 42 AND 62))
                          TableScan [TS_7] (rows=575995635 width=88)
                            
default@store_sales,store_sales,Tbl:COMPLETE,Col:NONE,Output:["ss_quantity","ss_wholesale_cost","ss_list_price","ss_coupon_amt"]
            <-Reducer 7 [CUSTOM_SIMPLE_EDGE]
              PARTITION_ONLY_SHUFFLE [RS_44]
                Group By Operator [GBY_19] (rows=1 width=416)
                  
Output:["_col0","_col1","_col2"],aggregations:["avg(VALUE._col0)","count(VALUE._col1)","count(DISTINCT
 KEY._col0:0._col0)"]
                <-Map 6 [SIMPLE_EDGE]
                  SHUFFLE [RS_18]
                    Group By Operator [GBY_17] (rows=21333171 width=88)
                      
Output:["_col0","_col1","_col2","_col3"],aggregations:["avg(ss_list_price)","count(ss_list_price)","count(DISTINCT
 ss_list_price)"],keys:ss_list_price
                      Select Operator [SEL_16] (rows=21333171 width=88)
                        Output:["ss_list_price"]
                        Filter Operator [FIL_54] (rows=21333171 width=88)
                          predicate:(ss_quantity BETWEEN 21 AND 25 and 
(ss_list_price BETWEEN 135 AND 145 or ss_coupon_amt BETWEEN 14180 AND 15180 or 
ss_wholesale_cost BETWEEN 38 AND 58))
                          TableScan [TS_14] (rows=575995635 width=88)
                            
default@store_sales,store_sales,Tbl:COMPLETE,Col:NONE,Output:["ss_quantity","ss_wholesale_cost","ss_list_price","ss_coupon_amt"]
            <-Reducer 9 [CUSTOM_SIMPLE_EDGE]
              PARTITION_ONLY_SHUFFLE [RS_45]
                Group By Operator [GBY_26] (rows=1 width=416)
                  
Output:["_col0","_col1","_col2"],aggregations:["avg(VALUE._col0)","count(VALUE._col1)","count(DISTINCT
 KEY._col0:0._col0)"]
                <-Map 8 [SIMPLE_EDGE]
                  SHUFFLE [RS_25]
                    Group By Operator [GBY_24] (rows=21333171 width=88)
                      
Output:["_col0","_col1","_col2","_col3"],aggregations:["avg(ss_list_price)","count(ss_list_price)","count(DISTINCT
 ss_list_price)"],keys:ss_list_price
                      Select Operator [SEL_23] (rows=21333171 width=88)
                        Output:["ss_list_price"]
                        Filter Operator [FIL_55] (rows=21333171 width=88)
                          predicate:(ss_quantity BETWEEN 16 AND 20 and 
(ss_list_price BETWEEN 142 AND 152 or ss_coupon_amt BETWEEN 3054 AND 4054 or 
ss_wholesale_cost BETWEEN 80 AND 100))
                          TableScan [TS_21] (rows=575995635 width=88)
                            
default@store_sales,store_sales,Tbl:COMPLETE,Col:NONE,Output:["ss_quantity","ss_wholesale_cost","ss_list_price","ss_coupon_amt"]
{code}

After the shared scan optimization, explain
{code}
Vertex dependency in root stage
Reducer 2 <- Map 1 (SIMPLE_EDGE)
Reducer 3 <- Reducer 2 (CUSTOM_SIMPLE_EDGE), Reducer 4 (CUSTOM_SIMPLE_EDGE), 
Reducer 5 (CUSTOM_SIMPLE_EDGE), Reducer 6 (CUSTOM_SIMPLE_EDGE), Reducer 7 
(CUSTOM_SIMPLE_EDGE), Reducer 8 (CUSTOM_SIMPLE_EDGE)
Reducer 4 <- Map 1 (SIMPLE_EDGE)
Reducer 5 <- Map 1 (SIMPLE_EDGE)
Reducer 6 <- Map 1 (SIMPLE_EDGE)
Reducer 7 <- Map 1 (SIMPLE_EDGE)
Reducer 8 <- Map 1 (SIMPLE_EDGE)

Stage-0
  Fetch Operator
    limit:100
    Stage-1
      Reducer 3
      File Output Operator [FS_51]
        Limit [LIM_50] (rows=1 width=2497)
          Number of rows:100
          Select Operator [SEL_49] (rows=1 width=2497)
            
Output:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8","_col9","_col10","_col11","_col12","_col13","_col14","_col15","_col16","_col17"]
            Merge Join Operator [MERGEJOIN_58] (rows=1 width=2497)
              
Conds:(Inner),(Inner),(Inner),(Inner),(Inner),Output:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8","_col9","_col10","_col11","_col12","_col13","_col14","_col15","_col16","_col17"]
            <-Reducer 2 [CUSTOM_SIMPLE_EDGE]
              PARTITION_ONLY_SHUFFLE [RS_42]
                Group By Operator [GBY_5] (rows=1 width=416)
                  
Output:["_col0","_col1","_col2"],aggregations:["avg(VALUE._col0)","count(VALUE._col1)","count(DISTINCT
 KEY._col0:0._col0)"]
                <-Map 1 [SIMPLE_EDGE]
                  SHUFFLE [RS_4]
                    Group By Operator [GBY_3] (rows=21333171 width=88)
                      
Output:["_col0","_col1","_col2","_col3"],aggregations:["avg(ss_list_price)","count(ss_list_price)","count(DISTINCT
 ss_list_price)"],keys:ss_list_price
                      Select Operator [SEL_2] (rows=21333171 width=88)
                        Output:["ss_list_price"]
                        Filter Operator [FIL_52] (rows=21333171 width=88)
                          predicate:(ss_quantity BETWEEN 0 AND 5 and 
(ss_list_price BETWEEN 11 AND 21 or ss_coupon_amt BETWEEN 460 AND 1460 or 
ss_wholesale_cost BETWEEN 14 AND 34))
                          TableScan [TS_0] (rows=575995635 width=88)
                            
default@store_sales,store_sales,Tbl:COMPLETE,Col:NONE,Output:["ss_quantity","ss_wholesale_cost","ss_list_price","ss_coupon_amt"]
            <-Reducer 4 [CUSTOM_SIMPLE_EDGE]
              PARTITION_ONLY_SHUFFLE [RS_43]
                Group By Operator [GBY_12] (rows=1 width=416)
                  
Output:["_col0","_col1","_col2"],aggregations:["avg(VALUE._col0)","count(VALUE._col1)","count(DISTINCT
 KEY._col0:0._col0)"]
                <-Map 1 [SIMPLE_EDGE]
                  SHUFFLE [RS_11]
                    Group By Operator [GBY_10] (rows=21333171 width=88)
                      
Output:["_col0","_col1","_col2","_col3"],aggregations:["avg(ss_list_price)","count(ss_list_price)","count(DISTINCT
 ss_list_price)"],keys:ss_list_price
                      Select Operator [SEL_9] (rows=21333171 width=88)
                        Output:["ss_list_price"]
                        Filter Operator [FIL_53] (rows=21333171 width=88)
                          predicate:(ss_quantity BETWEEN 26 AND 30 and 
(ss_list_price BETWEEN 28 AND 38 or ss_coupon_amt BETWEEN 2513 AND 3513 or 
ss_wholesale_cost BETWEEN 42 AND 62))
                           Please refer to the previous TableScan [TS_0]
            <-Reducer 5 [CUSTOM_SIMPLE_EDGE]
              PARTITION_ONLY_SHUFFLE [RS_44]
                Group By Operator [GBY_19] (rows=1 width=416)
                  
Output:["_col0","_col1","_col2"],aggregations:["avg(VALUE._col0)","count(VALUE._col1)","count(DISTINCT
 KEY._col0:0._col0)"]
                <-Map 1 [SIMPLE_EDGE]
                  SHUFFLE [RS_18]
                    Group By Operator [GBY_17] (rows=21333171 width=88)
                      
Output:["_col0","_col1","_col2","_col3"],aggregations:["avg(ss_list_price)","count(ss_list_price)","count(DISTINCT
 ss_list_price)"],keys:ss_list_price
                      Select Operator [SEL_16] (rows=21333171 width=88)
                        Output:["ss_list_price"]
                        Filter Operator [FIL_54] (rows=21333171 width=88)
                          predicate:(ss_quantity BETWEEN 21 AND 25 and 
(ss_list_price BETWEEN 135 AND 145 or ss_coupon_amt BETWEEN 14180 AND 15180 or 
ss_wholesale_cost BETWEEN 38 AND 58))
                           Please refer to the previous TableScan [TS_0]
            <-Reducer 6 [CUSTOM_SIMPLE_EDGE]
              PARTITION_ONLY_SHUFFLE [RS_45]
                Group By Operator [GBY_26] (rows=1 width=416)
                  
Output:["_col0","_col1","_col2"],aggregations:["avg(VALUE._col0)","count(VALUE._col1)","count(DISTINCT
 KEY._col0:0._col0)"]
                <-Map 1 [SIMPLE_EDGE]
                  SHUFFLE [RS_25]
                    Group By Operator [GBY_24] (rows=21333171 width=88)
                      
Output:["_col0","_col1","_col2","_col3"],aggregations:["avg(ss_list_price)","count(ss_list_price)","count(DISTINCT
 ss_list_price)"],keys:ss_list_price
                      Select Operator [SEL_23] (rows=21333171 width=88)
                        Output:["ss_list_price"]
                        Filter Operator [FIL_55] (rows=21333171 width=88)
                          predicate:(ss_quantity BETWEEN 16 AND 20 and 
(ss_list_price BETWEEN 142 AND 152 or ss_coupon_amt BETWEEN 3054 AND 4054 or 
ss_wholesale_cost BETWEEN 80 AND 100))
                           Please refer to the previous TableScan [TS_0]
            <-Reducer 7 [CUSTOM_SIMPLE_EDGE]
              PARTITION_ONLY_SHUFFLE [RS_46]
                Group By Operator [GBY_33] (rows=1 width=416)
                  
Output:["_col0","_col1","_col2"],aggregations:["avg(VALUE._col0)","count(VALUE._col1)","count(DISTINCT
 KEY._col0:0._col0)"]
                <-Map 1 [SIMPLE_EDGE]
                  SHUFFLE [RS_32]
                    Group By Operator [GBY_31] (rows=21333171 width=88)
                      
Output:["_col0","_col1","_col2","_col3"],aggregations:["avg(ss_list_price)","count(ss_list_price)","count(DISTINCT
 ss_list_price)"],keys:ss_list_price
                      Select Operator [SEL_30] (rows=21333171 width=88)
                        Output:["ss_list_price"]
                        Filter Operator [FIL_56] (rows=21333171 width=88)
                          predicate:(ss_quantity BETWEEN 11 AND 15 and 
(ss_list_price BETWEEN 66 AND 76 or ss_coupon_amt BETWEEN 920 AND 1920 or 
ss_wholesale_cost BETWEEN 4 AND 24))
                           Please refer to the previous TableScan [TS_0]
            <-Reducer 8 [CUSTOM_SIMPLE_EDGE]
              PARTITION_ONLY_SHUFFLE [RS_47]
                Group By Operator [GBY_40] (rows=1 width=416)
                  
Output:["_col0","_col1","_col2"],aggregations:["avg(VALUE._col0)","count(VALUE._col1)","count(DISTINCT
 KEY._col0:0._col0)"]
                <-Map 1 [SIMPLE_EDGE]
                  SHUFFLE [RS_39]
                    Group By Operator [GBY_38] (rows=21333171 width=88)
                      
Output:["_col0","_col1","_col2","_col3"],aggregations:["avg(ss_list_price)","count(ss_list_price)","count(DISTINCT
 ss_list_price)"],keys:ss_list_price
                      Select Operator [SEL_37] (rows=21333171 width=88)
                        Output:["ss_list_price"]
                        Filter Operator [FIL_57] (rows=21333171 width=88)
                          predicate:(ss_quantity BETWEEN 6 AND 10 and 
(ss_list_price BETWEEN 91 AND 101 or ss_coupon_amt BETWEEN 1430 AND 2430 or 
ss_wholesale_cost BETWEEN 32 AND 52))
                           Please refer to the previous TableScan [TS_0]

{code}

Before there are 6 Maps, After there are only 1 Map. In After explain, there 
are  6 edges from Map1 to Reducer 2, Reducer 4, Reducer 5, Reducer 6, Reducer 7 
and Reducer 8. Here i want to ask question whether  these edges can be shared 
so the table scan can be reused or not?

> Implement shared scans with Tez
> -------------------------------
>
>                 Key: HIVE-16602
>                 URL: https://issues.apache.org/jira/browse/HIVE-16602
>             Project: Hive
>          Issue Type: New Feature
>          Components: Physical Optimizer
>    Affects Versions: 3.0.0
>            Reporter: Jesus Camacho Rodriguez
>            Assignee: Jesus Camacho Rodriguez
>            Priority: Major
>             Fix For: 3.0.0
>
>         Attachments: HIVE-16602.01.patch, HIVE-16602.02.patch, 
> HIVE-16602.03.patch, HIVE-16602.04.patch, HIVE-16602.patch
>
>
> Given a query plan, the goal is to identify scans on input tables that can be 
> merged so the data is read only once. Optimization will be carried out at the 
> physical level.
> In the longer term, identification of equivalent expressions and 
> reutilization of intermediary results should be done at the logical layer via 
> Spool operator.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to