[jira] [Updated] (HIVE-16600) Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel order by in multi_insert cases

liyunzhang_intel (JIRA) Wed, 17 May 2017 22:08:22 -0700

     [ 
https://issues.apache.org/jira/browse/HIVE-16600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


liyunzhang_intel updated HIVE-16600:
------------------------------------
    Attachment: HIVE-16600.6.patch

[~lirui]:
help review HIVE-16600.5.patch
bq.Please add more tests. E.g. add limit to the sub-query, add order by to 
inserts, etc.
 I add following test, is this what you mean "add order by to inserts"?
{code}
FROM (select key,value from src order by key) a
INSERT OVERWRITE TABLE e1
    SELECT key, value
INSERT OVERWRITE TABLE e2
    SELECT key order by key limit 10;
{code}

If yes, it is very interesting that the explain is following, it seems that the 
part of e1 is deleted in this case. When you execute the query, you will find 
there is no result in e1 in this case.
{code}
STAGE DEPENDENCIES:
  Stage-2 is a root stage
  Stage-1 depends on stages: Stage-2
  Stage-3 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-2
    Spark
      Edges:
        Reducer 2 <- Map 1 (SORT, 1)
#### A masked pattern was here #### 
      Vertices:
        Map 1 
            Map Operator Tree:
                TableScan
                  alias: src
                  Statistics: Num rows: 500 Data size: 5312 Basic stats: 
COMPLETE Column stats: NONE 
                  Select Operator
                    expressions: key (type: string), value (type: string)
                    outputColumnNames: _col0, _col1
                    Statistics: Num rows: 500 Data size: 5312 Basic stats: 
COMPLETE Column stats: NONE 
                    Reduce Output Operator
                      key expressions: _col0 (type: string)
                      sort order: +
                      Statistics: Num rows: 500 Data size: 5312 Basic stats: 
COMPLETE Column stats: NONE 
                      TopN Hash Memory Usage: 0.1
        Reducer 2 
            Reduce Operator Tree:
              Select Operator
                expressions: KEY.reducesinkkey0 (type: string)
                outputColumnNames: _col0
                Statistics: Num rows: 500 Data size: 5312 Basic stats: COMPLETE 
Column stats: NONE
                Limit
                  Number of rows: 10
                  Statistics: Num rows: 10 Data size: 100 Basic stats: COMPLETE 
Column stats: NONE
                  File Output Operator
                    compressed: false
                    Statistics: Num rows: 10 Data size: 100 Basic stats: 
COMPLETE Column stats: NONE
                    table:
                        input format: org.apache.hadoop.mapred.TextInputFormat
                        output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                        serde: 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                        name: default.e2

  Stage: Stage-1
    Move Operator
      tables:
          replace: true
          table:
              input format: org.apache.hadoop.mapred.TextInputFormat
              output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
              serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
              name: default.e2

  Stage: Stage-3
    Stats-Aggr Operator

PREHOOK: query: FROM (select key,value from src order by key) a
INSERT OVERWRITE TABLE e1
    SELECT key, value
INSERT OVERWRITE TABLE e2
                    compressed: false
                    Statistics: Num rows: 10 Data size: 100 Basic stats: 
COMPLETE Column stats: NONE
                    table:
                        input format: org.apache.hadoop.mapred.TextInputFormat
                        output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                        serde: 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                        name: default.e2

  Stage: Stage-1
    Move Operator
      tables:
          replace: true
          table:
              input format: org.apache.hadoop.mapred.TextInputFormat
              output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
              serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
              name: default.e2

  Stage: Stage-3
    Stats-Aggr Operator
{code}


> Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel 
> order by in multi_insert cases
> --------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-16600
>                 URL: https://issues.apache.org/jira/browse/HIVE-16600
>             Project: Hive
>          Issue Type: Sub-task
>            Reporter: liyunzhang_intel
>            Assignee: liyunzhang_intel
>         Attachments: HIVE-16600.1.patch, HIVE-16600.2.patch, 
> HIVE-16600.3.patch, HIVE-16600.4.patch, HIVE-16600.5.patch, 
> HIVE-16600.6.patch, mr.explain, mr.explain.log.HIVE-16600
>
>
> multi_insert_gby.case.q
> {code}
> set hive.exec.reducers.bytes.per.reducer=256;
> set hive.optimize.sampling.orderby=true;
> drop table if exists e1;
> drop table if exists e2;
> create table e1 (key string, value string);
> create table e2 (key string);
> FROM (select key, cast(key as double) as keyD, value from src order by key) a
> INSERT OVERWRITE TABLE e1
>     SELECT key, value
> INSERT OVERWRITE TABLE e2
>     SELECT key;
> select * from e1;
> select * from e2;
> {code} 
> the parallelism of Sort is 1 even we enable parallel order 
> by("hive.optimize.sampling.orderby" is set as "true").  This is not 
> reasonable because the parallelism  should be calcuated by  
> [Utilities.estimateReducers|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L170]
> this is because SetSparkReducerParallelism#needSetParallelism returns false 
> when [children size of 
> RS|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L207]
>  is greater than 1.
> in this case, the children size of {{RS[2]}} is two.
> the logical plan of the case
> {code}
>    TS[0]-SEL[1]-RS[2]-SEL[3]-SEL[4]-FS[5]
>                             -SEL[6]-FS[7]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (HIVE-16600) Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel order by in multi_insert cases

Reply via email to