[jira] [Commented] (HIVE-16600) Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel order by in multi_insert cases

liyunzhang_intel (JIRA) Thu, 11 May 2017 01:20:23 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-16600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16006069#comment-16006069
 ]


liyunzhang_intel commented on HIVE-16600:
-----------------------------------------

[~lirui]: i did not copy all the plan of mr. but in the attachment 
mr.explain.log.HIVE-16600.  But we can see that in the reduce operator tree of 
Stage-2, it contains two select operators which contains multi insert.  This 
shows that there is no extra stage in mr mode.
{code}
      Reduce Operator Tree:
        Select Operator
          expressions: KEY.reducesinkkey0 (type: string), VALUE._col1 (type: 
string)
          outputColumnNames: _col0, _col2
          Select Operator
            expressions: _col0 (type: string), _col2 (type: string)
            outputColumnNames: _col0, _col1
            File Output Operator
              compressed: false
              GlobalTableId: 1
              directory: 
hdfs://bdpe41:8020/user/hive/warehouse/e1/.hive-staging_hive_2017-05-11_15-03-16_656_7584605594727686973-1/-ext-10000
              NumFilesPerFileSink: 1
              Stats Publishing Key Prefix: 
hdfs://bdpe41:8020/user/hive/warehouse/e1/.hive-staging_hive_2017-05-11_15-03-16_656_7584605594727686973-1/-ext-10000/
              table:
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                  properties:
                    COLUMN_STATS_ACCURATE {"BASIC_STATS":"true"}
                    bucket_count -1
                    column.name.delimiter ,
                    columns key,value
                    columns.comments 
                    columns.types string:string
                    file.inputformat org.apache.hadoop.mapred.TextInputFormat
                    file.outputformat 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                    location hdfs://bdpe41:8020/user/hive/warehouse/e1
                    name default.e1
                    numFiles 0
                    numRows 0
                    rawDataSize 0
                    serialization.ddl struct e1 { string key, string value}
                    serialization.format 1
                    serialization.lib 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                    totalSize 0
                    transient_lastDdlTime 1494486196
                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                  name: default.e1
              TotalFiles: 1
              GatherStats: true
              MultiFileSpray: false
          Select Operator
            expressions: _col0 (type: string)
            outputColumnNames: _col0
            Limit
              Number of rows: 10
              File Output Operator
                compressed: false
                GlobalTableId: 0
                directory: 
hdfs://bdpe41:8020/tmp/hive/root/7f5afded-5f75-46f9-a588-e7317a8decca/hive_2017-05-11_15-03-16_656_7584605594727686973-1/-mr-10004
                NumFilesPerFileSink: 1
                table:
                    input format: 
org.apache.hadoop.mapred.SequenceFileInputFormat
                    output format: 
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                    properties:
                      column.name.delimiter ,
                      columns _col0
                      columns.types string
                      escape.delim \
                      serialization.lib 
org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe
                    serde: 
org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe
                TotalFiles: 1
                GatherStats: false
                MultiFileSpray: false

{code}


> Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel 
> order by in multi_insert cases
> --------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-16600
>                 URL: https://issues.apache.org/jira/browse/HIVE-16600
>             Project: Hive
>          Issue Type: Sub-task
>            Reporter: liyunzhang_intel
>            Assignee: liyunzhang_intel
>         Attachments: HIVE-16600.1.patch, HIVE-16600.2.patch, 
> mr.explain.log.HIVE-16600
>
>
> multi_insert_gby.case.q
> {code}
> set hive.exec.reducers.bytes.per.reducer=256;
> set hive.optimize.sampling.orderby=true;
> drop table if exists e1;
> drop table if exists e2;
> create table e1 (key string, value string);
> create table e2 (key string);
> FROM (select key, cast(key as double) as keyD, value from src order by key) a
> INSERT OVERWRITE TABLE e1
>     SELECT key, value
> INSERT OVERWRITE TABLE e2
>     SELECT key;
> select * from e1;
> select * from e2;
> {code} 
> the parallelism of Sort is 1 even we enable parallel order 
> by("hive.optimize.sampling.orderby" is set as "true").  This is not 
> reasonable because the parallelism  should be calcuated by  
> [Utilities.estimateReducers|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L170]
> this is because SetSparkReducerParallelism#needSetParallelism returns false 
> when [children size of 
> RS|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L207]
>  is greater than 1.
> in this case, the children size of {{RS[2]}} is two.
> the logical plan of the case
> {code}
>    TS[0]-SEL[1]-RS[2]-SEL[3]-SEL[4]-FS[5]
>                             -SEL[6]-FS[7]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (HIVE-16600) Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel order by in multi_insert cases

Reply via email to