[jira] [Updated] (HIVE-16668) Hive on Spark generates incorrect plan and result with window function and lateral view

Chao Sun (JIRA) Tue, 16 May 2017 11:22:21 -0700

     [ 
https://issues.apache.org/jira/browse/HIVE-16668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Chao Sun updated HIVE-16668:
----------------------------
    Description: 
To reproduce:
{code}
create table t1 (a string);
create table t2 (a array<string>);
create table dummy (a string);
insert into table dummy values ("a");
insert into t1 values ("1"), ("2");
insert into t2 select array("1", "2", "3", "4") from dummy;
set hive.auto.convert.join.noconditionaltask.size=3;

explain
with tt1 as (
  select a as id, count(*) over () as count
  from t1
),
tt2 as (
  select id
  from t2
  lateral view outer explode(a) a_tbl as id
)
select tt1.count
from tt1 join tt2 on tt1.id = tt2.id;
{code}

For Hive on Spark, the plan is:
{code}
STAGE DEPENDENCIES:
  Stage-2 is a root stage
  Stage-1 depends on stages: Stage-2
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-2
    Spark
      Edges:
        Reducer 2 <- Map 1 (PARTITION-LEVEL SORT, 3), Map 1 (PARTITION-LEVEL 
SORT, 3)
      DagName: chao_20170515133259_de9e0583-da24-4399-afc8-b881dfef0469:9
      Vertices:
        Map 1
            Map Operator Tree:
                TableScan
                  alias: t1
                  Statistics: Num rows: 2 Data size: 2 Basic stats: COMPLETE 
Column stats: NONE
                  Reduce Output Operator
                    key expressions: 0 (type: int)
                    sort order: +
                    Map-reduce partition columns: 0 (type: int)
                    Statistics: Num rows: 2 Data size: 2 Basic stats: COMPLETE 
Column stats: NONE
                    value expressions: a (type: string)
        Reducer 2
            Local Work:
              Map Reduce Local Work
            Reduce Operator Tree:
              Select Operator
                expressions: VALUE._col0 (type: string)
                outputColumnNames: _col0
                Statistics: Num rows: 2 Data size: 2 Basic stats: COMPLETE 
Column stats: NONE
                PTF Operator
                  Function definitions:
                      Input definition
                        input alias: ptf_0
                        output shape: _col0: string
                        type: WINDOWING
                      Windowing table definition
                        input alias: ptf_1
                        name: windowingtablefunction
                        order by: 0 ASC NULLS FIRST
                        partition by: 0
                        raw input shape:
                        window functions:
                            window function definition
                              alias: count_window_0
                              name: count
                              window function: GenericUDAFCountEvaluator
                              window frame: PRECEDING(MAX)~FOLLOWING(MAX)
                              isStar: true
                  Statistics: Num rows: 2 Data size: 2 Basic stats: COMPLETE 
Column stats: NONE
                  Filter Operator
                    predicate: _col0 is not null (type: boolean)
                    Statistics: Num rows: 2 Data size: 2 Basic stats: COMPLETE 
Column stats: NONE
                    Select Operator
                      expressions: _col0 (type: string), count_window_0 (type: 
bigint)
                      outputColumnNames: _col0, _col1
                      Statistics: Num rows: 2 Data size: 2 Basic stats: 
COMPLETE Column stats: NONE
                      Spark HashTable Sink Operator
                        keys:
                          0 _col0 (type: string)
                          1 _col0 (type: string)
                      Reduce Output Operator
                        key expressions: _col0 (type: string)
                        sort order: +
                        Map-reduce partition columns: _col0 (type: string)
                        Statistics: Num rows: 2 Data size: 2 Basic stats: 
COMPLETE Column stats: NONE
                        value expressions: _col1 (type: bigint)

  Stage: Stage-1
    Spark
      DagName: chao_20170515133259_de9e0583-da24-4399-afc8-b881dfef0469:8
      Vertices:
        Map 3
            Map Operator Tree:
                TableScan
                  alias: t2
                  Statistics: Num rows: 1 Data size: 20 Basic stats: COMPLETE 
Column stats: NONE
                  Lateral View Forward
                    Statistics: Num rows: 1 Data size: 20 Basic stats: COMPLETE 
Column stats: NONE
                    Select Operator
                      Statistics: Num rows: 1 Data size: 20 Basic stats: 
COMPLETE Column stats: NONE
                      Lateral View Join Operator
                        outputColumnNames: _col4
                        Statistics: Num rows: 2 Data size: 40 Basic stats: 
COMPLETE Column stats: NONE
                        Select Operator
                          expressions: _col4 (type: string)
                          outputColumnNames: _col0
                          Statistics: Num rows: 2 Data size: 40 Basic stats: 
COMPLETE Column stats: NONE
                          Map Join Operator
                            condition map:
                                 Inner Join 0 to 1
                            keys:
                              0 _col0 (type: string)
                              1 _col0 (type: string)
                            outputColumnNames: _col1
                            input vertices:
                              0 Reducer 2
                            Statistics: Num rows: 2 Data size: 2 Basic stats: 
COMPLETE Column stats: NONE
                            Select Operator
                              expressions: _col1 (type: bigint)
                              outputColumnNames: _col0
                              Statistics: Num rows: 2 Data size: 2 Basic stats: 
COMPLETE Column stats: NONE
                              File Output Operator
                                compressed: false
                                Statistics: Num rows: 2 Data size: 2 Basic 
stats: COMPLETE Column stats: NONE
                                table:
                                    input format: 
org.apache.hadoop.mapred.SequenceFileInputFormat
                                    output format: 
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                                    serde: 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                    Select Operator
                      expressions: a (type: array<string>)
                      outputColumnNames: _col0
                      Statistics: Num rows: 1 Data size: 20 Basic stats: 
COMPLETE Column stats: NONE
                      UDTF Operator
                        Statistics: Num rows: 1 Data size: 20 Basic stats: 
COMPLETE Column stats: NONE
                        function name: explode
                        outer lateral view: true
                        Filter Operator
                          predicate: col is not null (type: boolean)
                          Statistics: Num rows: 1 Data size: 20 Basic stats: 
COMPLETE Column stats: NONE
                          Lateral View Join Operator
                            outputColumnNames: _col4
                            Statistics: Num rows: 2 Data size: 40 Basic stats: 
COMPLETE Column stats: NONE
                            Select Operator
                              expressions: _col4 (type: string)
                              outputColumnNames: _col0
                              Statistics: Num rows: 2 Data size: 40 Basic 
stats: COMPLETE Column stats: NONE
                              Map Join Operator
                                condition map:
                                     Inner Join 0 to 1
                                keys:
                                  0 _col0 (type: string)
                                  1 _col0 (type: string)
                                outputColumnNames: _col1
                                input vertices:
                                  0 Reducer 2
                                Statistics: Num rows: 2 Data size: 2 Basic 
stats: COMPLETE Column stats: NONE
                                Select Operator
                                  expressions: _col1 (type: bigint)
                                  outputColumnNames: _col0
                                  Statistics: Num rows: 2 Data size: 2 Basic 
stats: COMPLETE Column stats: NONE
                                  File Output Operator
                                    compressed: false
                                    Statistics: Num rows: 2 Data size: 2 Basic 
stats: COMPLETE Column stats: NONE
                                    table:
                                        input format: 
org.apache.hadoop.mapred.SequenceFileInputFormat
                                        output format: 
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                                        serde: 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
            Local Work:
              Map Reduce Local Work

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink
{code}

Note that there're two {{Map 1}} s as inputs for {{Reduce 2}}.
The result for this query is:
{code}
4
4
4
4
{code} 
for Hive on Spark, which is not correct.

  was:
To reproduce:
{code}
create table t1 (a string);
create table t2 (a array<string>);
create table dummy (a string);
insert into table dummy values ("a");
insert into t1 values ("1"), ("2");
insert overwrite into t2 select array("1", "2", "3", "4") from dummy;
set hive.auto.convert.join.noconditionaltask.size=3;

explain
with tt1 as (
  select a as id, count(*) over () as count
  from t1
),
tt2 as (
  select id
  from t2
  lateral view outer explode(a) a_tbl as id
)
select tt1.count
from tt1 join tt2 on tt1.id = tt2.id;
{code}

For Hive on Spark, the plan is:
{code}
STAGE DEPENDENCIES:
  Stage-2 is a root stage
  Stage-1 depends on stages: Stage-2
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-2
    Spark
      Edges:
        Reducer 2 <- Map 1 (PARTITION-LEVEL SORT, 3), Map 1 (PARTITION-LEVEL 
SORT, 3)
      DagName: chao_20170515133259_de9e0583-da24-4399-afc8-b881dfef0469:9
      Vertices:
        Map 1
            Map Operator Tree:
                TableScan
                  alias: t1
                  Statistics: Num rows: 2 Data size: 2 Basic stats: COMPLETE 
Column stats: NONE
                  Reduce Output Operator
                    key expressions: 0 (type: int)
                    sort order: +
                    Map-reduce partition columns: 0 (type: int)
                    Statistics: Num rows: 2 Data size: 2 Basic stats: COMPLETE 
Column stats: NONE
                    value expressions: a (type: string)
        Reducer 2
            Local Work:
              Map Reduce Local Work
            Reduce Operator Tree:
              Select Operator
                expressions: VALUE._col0 (type: string)
                outputColumnNames: _col0
                Statistics: Num rows: 2 Data size: 2 Basic stats: COMPLETE 
Column stats: NONE
                PTF Operator
                  Function definitions:
                      Input definition
                        input alias: ptf_0
                        output shape: _col0: string
                        type: WINDOWING
                      Windowing table definition
                        input alias: ptf_1
                        name: windowingtablefunction
                        order by: 0 ASC NULLS FIRST
                        partition by: 0
                        raw input shape:
                        window functions:
                            window function definition
                              alias: count_window_0
                              name: count
                              window function: GenericUDAFCountEvaluator
                              window frame: PRECEDING(MAX)~FOLLOWING(MAX)
                              isStar: true
                  Statistics: Num rows: 2 Data size: 2 Basic stats: COMPLETE 
Column stats: NONE
                  Filter Operator
                    predicate: _col0 is not null (type: boolean)
                    Statistics: Num rows: 2 Data size: 2 Basic stats: COMPLETE 
Column stats: NONE
                    Select Operator
                      expressions: _col0 (type: string), count_window_0 (type: 
bigint)
                      outputColumnNames: _col0, _col1
                      Statistics: Num rows: 2 Data size: 2 Basic stats: 
COMPLETE Column stats: NONE
                      Spark HashTable Sink Operator
                        keys:
                          0 _col0 (type: string)
                          1 _col0 (type: string)
                      Reduce Output Operator
                        key expressions: _col0 (type: string)
                        sort order: +
                        Map-reduce partition columns: _col0 (type: string)
                        Statistics: Num rows: 2 Data size: 2 Basic stats: 
COMPLETE Column stats: NONE
                        value expressions: _col1 (type: bigint)

  Stage: Stage-1
    Spark
      DagName: chao_20170515133259_de9e0583-da24-4399-afc8-b881dfef0469:8
      Vertices:
        Map 3
            Map Operator Tree:
                TableScan
                  alias: t2
                  Statistics: Num rows: 1 Data size: 20 Basic stats: COMPLETE 
Column stats: NONE
                  Lateral View Forward
                    Statistics: Num rows: 1 Data size: 20 Basic stats: COMPLETE 
Column stats: NONE
                    Select Operator
                      Statistics: Num rows: 1 Data size: 20 Basic stats: 
COMPLETE Column stats: NONE
                      Lateral View Join Operator
                        outputColumnNames: _col4
                        Statistics: Num rows: 2 Data size: 40 Basic stats: 
COMPLETE Column stats: NONE
                        Select Operator
                          expressions: _col4 (type: string)
                          outputColumnNames: _col0
                          Statistics: Num rows: 2 Data size: 40 Basic stats: 
COMPLETE Column stats: NONE
                          Map Join Operator
                            condition map:
                                 Inner Join 0 to 1
                            keys:
                              0 _col0 (type: string)
                              1 _col0 (type: string)
                            outputColumnNames: _col1
                            input vertices:
                              0 Reducer 2
                            Statistics: Num rows: 2 Data size: 2 Basic stats: 
COMPLETE Column stats: NONE
                            Select Operator
                              expressions: _col1 (type: bigint)
                              outputColumnNames: _col0
                              Statistics: Num rows: 2 Data size: 2 Basic stats: 
COMPLETE Column stats: NONE
                              File Output Operator
                                compressed: false
                                Statistics: Num rows: 2 Data size: 2 Basic 
stats: COMPLETE Column stats: NONE
                                table:
                                    input format: 
org.apache.hadoop.mapred.SequenceFileInputFormat
                                    output format: 
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                                    serde: 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                    Select Operator
                      expressions: a (type: array<string>)
                      outputColumnNames: _col0
                      Statistics: Num rows: 1 Data size: 20 Basic stats: 
COMPLETE Column stats: NONE
                      UDTF Operator
                        Statistics: Num rows: 1 Data size: 20 Basic stats: 
COMPLETE Column stats: NONE
                        function name: explode
                        outer lateral view: true
                        Filter Operator
                          predicate: col is not null (type: boolean)
                          Statistics: Num rows: 1 Data size: 20 Basic stats: 
COMPLETE Column stats: NONE
                          Lateral View Join Operator
                            outputColumnNames: _col4
                            Statistics: Num rows: 2 Data size: 40 Basic stats: 
COMPLETE Column stats: NONE
                            Select Operator
                              expressions: _col4 (type: string)
                              outputColumnNames: _col0
                              Statistics: Num rows: 2 Data size: 40 Basic 
stats: COMPLETE Column stats: NONE
                              Map Join Operator
                                condition map:
                                     Inner Join 0 to 1
                                keys:
                                  0 _col0 (type: string)
                                  1 _col0 (type: string)
                                outputColumnNames: _col1
                                input vertices:
                                  0 Reducer 2
                                Statistics: Num rows: 2 Data size: 2 Basic 
stats: COMPLETE Column stats: NONE
                                Select Operator
                                  expressions: _col1 (type: bigint)
                                  outputColumnNames: _col0
                                  Statistics: Num rows: 2 Data size: 2 Basic 
stats: COMPLETE Column stats: NONE
                                  File Output Operator
                                    compressed: false
                                    Statistics: Num rows: 2 Data size: 2 Basic 
stats: COMPLETE Column stats: NONE
                                    table:
                                        input format: 
org.apache.hadoop.mapred.SequenceFileInputFormat
                                        output format: 
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                                        serde: 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
            Local Work:
              Map Reduce Local Work

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink
{code}

Note that there're two {{Map 1}} s as inputs for {{Reduce 2}}.
The result for this query is:
{code}
4
4
4
4
{code} 
for Hive on Spark, which is not correct.


> Hive on Spark generates incorrect plan and result with window function and 
> lateral view
> ---------------------------------------------------------------------------------------
>
>                 Key: HIVE-16668
>                 URL: https://issues.apache.org/jira/browse/HIVE-16668
>             Project: Hive
>          Issue Type: Bug
>          Components: Spark
>            Reporter: Chao Sun
>            Assignee: Chao Sun
>         Attachments: HIVE-16668.1.patch
>
>
> To reproduce:
> {code}
> create table t1 (a string);
> create table t2 (a array<string>);
> create table dummy (a string);
> insert into table dummy values ("a");
> insert into t1 values ("1"), ("2");
> insert into t2 select array("1", "2", "3", "4") from dummy;
> set hive.auto.convert.join.noconditionaltask.size=3;
> explain
> with tt1 as (
>   select a as id, count(*) over () as count
>   from t1
> ),
> tt2 as (
>   select id
>   from t2
>   lateral view outer explode(a) a_tbl as id
> )
> select tt1.count
> from tt1 join tt2 on tt1.id = tt2.id;
> {code}
> For Hive on Spark, the plan is:
> {code}
> STAGE DEPENDENCIES:
>   Stage-2 is a root stage
>   Stage-1 depends on stages: Stage-2
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-2
>     Spark
>       Edges:
>         Reducer 2 <- Map 1 (PARTITION-LEVEL SORT, 3), Map 1 (PARTITION-LEVEL 
> SORT, 3)
>       DagName: chao_20170515133259_de9e0583-da24-4399-afc8-b881dfef0469:9
>       Vertices:
>         Map 1
>             Map Operator Tree:
>                 TableScan
>                   alias: t1
>                   Statistics: Num rows: 2 Data size: 2 Basic stats: COMPLETE 
> Column stats: NONE
>                   Reduce Output Operator
>                     key expressions: 0 (type: int)
>                     sort order: +
>                     Map-reduce partition columns: 0 (type: int)
>                     Statistics: Num rows: 2 Data size: 2 Basic stats: 
> COMPLETE Column stats: NONE
>                     value expressions: a (type: string)
>         Reducer 2
>             Local Work:
>               Map Reduce Local Work
>             Reduce Operator Tree:
>               Select Operator
>                 expressions: VALUE._col0 (type: string)
>                 outputColumnNames: _col0
>                 Statistics: Num rows: 2 Data size: 2 Basic stats: COMPLETE 
> Column stats: NONE
>                 PTF Operator
>                   Function definitions:
>                       Input definition
>                         input alias: ptf_0
>                         output shape: _col0: string
>                         type: WINDOWING
>                       Windowing table definition
>                         input alias: ptf_1
>                         name: windowingtablefunction
>                         order by: 0 ASC NULLS FIRST
>                         partition by: 0
>                         raw input shape:
>                         window functions:
>                             window function definition
>                               alias: count_window_0
>                               name: count
>                               window function: GenericUDAFCountEvaluator
>                               window frame: PRECEDING(MAX)~FOLLOWING(MAX)
>                               isStar: true
>                   Statistics: Num rows: 2 Data size: 2 Basic stats: COMPLETE 
> Column stats: NONE
>                   Filter Operator
>                     predicate: _col0 is not null (type: boolean)
>                     Statistics: Num rows: 2 Data size: 2 Basic stats: 
> COMPLETE Column stats: NONE
>                     Select Operator
>                       expressions: _col0 (type: string), count_window_0 
> (type: bigint)
>                       outputColumnNames: _col0, _col1
>                       Statistics: Num rows: 2 Data size: 2 Basic stats: 
> COMPLETE Column stats: NONE
>                       Spark HashTable Sink Operator
>                         keys:
>                           0 _col0 (type: string)
>                           1 _col0 (type: string)
>                       Reduce Output Operator
>                         key expressions: _col0 (type: string)
>                         sort order: +
>                         Map-reduce partition columns: _col0 (type: string)
>                         Statistics: Num rows: 2 Data size: 2 Basic stats: 
> COMPLETE Column stats: NONE
>                         value expressions: _col1 (type: bigint)
>   Stage: Stage-1
>     Spark
>       DagName: chao_20170515133259_de9e0583-da24-4399-afc8-b881dfef0469:8
>       Vertices:
>         Map 3
>             Map Operator Tree:
>                 TableScan
>                   alias: t2
>                   Statistics: Num rows: 1 Data size: 20 Basic stats: COMPLETE 
> Column stats: NONE
>                   Lateral View Forward
>                     Statistics: Num rows: 1 Data size: 20 Basic stats: 
> COMPLETE Column stats: NONE
>                     Select Operator
>                       Statistics: Num rows: 1 Data size: 20 Basic stats: 
> COMPLETE Column stats: NONE
>                       Lateral View Join Operator
>                         outputColumnNames: _col4
>                         Statistics: Num rows: 2 Data size: 40 Basic stats: 
> COMPLETE Column stats: NONE
>                         Select Operator
>                           expressions: _col4 (type: string)
>                           outputColumnNames: _col0
>                           Statistics: Num rows: 2 Data size: 40 Basic stats: 
> COMPLETE Column stats: NONE
>                           Map Join Operator
>                             condition map:
>                                  Inner Join 0 to 1
>                             keys:
>                               0 _col0 (type: string)
>                               1 _col0 (type: string)
>                             outputColumnNames: _col1
>                             input vertices:
>                               0 Reducer 2
>                             Statistics: Num rows: 2 Data size: 2 Basic stats: 
> COMPLETE Column stats: NONE
>                             Select Operator
>                               expressions: _col1 (type: bigint)
>                               outputColumnNames: _col0
>                               Statistics: Num rows: 2 Data size: 2 Basic 
> stats: COMPLETE Column stats: NONE
>                               File Output Operator
>                                 compressed: false
>                                 Statistics: Num rows: 2 Data size: 2 Basic 
> stats: COMPLETE Column stats: NONE
>                                 table:
>                                     input format: 
> org.apache.hadoop.mapred.SequenceFileInputFormat
>                                     output format: 
> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
>                                     serde: 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>                     Select Operator
>                       expressions: a (type: array<string>)
>                       outputColumnNames: _col0
>                       Statistics: Num rows: 1 Data size: 20 Basic stats: 
> COMPLETE Column stats: NONE
>                       UDTF Operator
>                         Statistics: Num rows: 1 Data size: 20 Basic stats: 
> COMPLETE Column stats: NONE
>                         function name: explode
>                         outer lateral view: true
>                         Filter Operator
>                           predicate: col is not null (type: boolean)
>                           Statistics: Num rows: 1 Data size: 20 Basic stats: 
> COMPLETE Column stats: NONE
>                           Lateral View Join Operator
>                             outputColumnNames: _col4
>                             Statistics: Num rows: 2 Data size: 40 Basic 
> stats: COMPLETE Column stats: NONE
>                             Select Operator
>                               expressions: _col4 (type: string)
>                               outputColumnNames: _col0
>                               Statistics: Num rows: 2 Data size: 40 Basic 
> stats: COMPLETE Column stats: NONE
>                               Map Join Operator
>                                 condition map:
>                                      Inner Join 0 to 1
>                                 keys:
>                                   0 _col0 (type: string)
>                                   1 _col0 (type: string)
>                                 outputColumnNames: _col1
>                                 input vertices:
>                                   0 Reducer 2
>                                 Statistics: Num rows: 2 Data size: 2 Basic 
> stats: COMPLETE Column stats: NONE
>                                 Select Operator
>                                   expressions: _col1 (type: bigint)
>                                   outputColumnNames: _col0
>                                   Statistics: Num rows: 2 Data size: 2 Basic 
> stats: COMPLETE Column stats: NONE
>                                   File Output Operator
>                                     compressed: false
>                                     Statistics: Num rows: 2 Data size: 2 
> Basic stats: COMPLETE Column stats: NONE
>                                     table:
>                                         input format: 
> org.apache.hadoop.mapred.SequenceFileInputFormat
>                                         output format: 
> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
>                                         serde: 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>             Local Work:
>               Map Reduce Local Work
>   Stage: Stage-0
>     Fetch Operator
>       limit: -1
>       Processor Tree:
>         ListSink
> {code}
> Note that there're two {{Map 1}} s as inputs for {{Reduce 2}}.
> The result for this query is:
> {code}
> 4
> 4
> 4
> 4
> {code} 
> for Hive on Spark, which is not correct.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (HIVE-16668) Hive on Spark generates incorrect plan and result with window function and lateral view

Reply via email to