[jira] [Commented] (HIVE-10841) [WHERE col is not null] does not work sometimes for queries with many JOIN statements

Alexander Pivovarov (JIRA) Wed, 03 Jun 2015 16:54:07 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-10841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14571849#comment-14571849
 ]


Alexander Pivovarov commented on HIVE-10841:
--------------------------------------------

LOG info for the queries with different JOIN operators order

L, LA, FR, A, PI, acct - only 2 log messages from ppd.OpProcFactory contain "= 
120"
{code}
explain
select
  acct.ACC_N,
  acct.brn
FROM L
JOIN LA ON L.id = LA.loan_id
JOIN FR ON L.id = FR.loan_id
JOIN A ON LA.aid = A.id
JOIN PI ON PI.id = LA.pi_id
JOIN acct ON A.id = acct.aid
WHERE
  L.id = 4436
  and acct.brn = 120;

15/06/03 16:31:47 [main]: INFO ppd.OpProcFactory: Processing for FIL(25)
15/06/03 16:31:47 [main]: INFO ppd.OpProcFactory: Pushdown Predicates of FIL 
For Alias : acct
15/06/03 16:31:47 [main]: INFO ppd.OpProcFactory:       (_col20 = 120)

15/06/03 16:31:47 [main]: INFO ppd.OpProcFactory: Processing for JOIN(24)
15/06/03 16:31:47 [main]: INFO ppd.OpProcFactory: Pushdown Predicates of JOIN 
For Alias : acct
15/06/03 16:31:47 [main]: INFO ppd.OpProcFactory:       (VALUE._col19 = 120)

  Stage: Stage-9
    Map Reduce
                    Select Operator
                      expressions: _col19 (type: int), 120 (type: int)
{code}

L, LA, FR, A, acct, PI - 8 log lines from  ppd.OpProcFactory contain "= 120"

{code}
explain
select
  acct.ACC_N,
  acct.brn
FROM L
JOIN LA ON L.id = LA.loan_id
JOIN FR ON L.id = FR.loan_id
JOIN A ON LA.aid = A.id
JOIN acct ON A.id = acct.aid
JOIN PI ON PI.id = LA.pi_id
WHERE
  L.id = 4436
  and acct.brn = 120;

15/06/03 15:45:25 [main]: INFO ppd.OpProcFactory: Processing for FIL(25)
15/06/03 15:45:39 [main]: INFO ppd.OpProcFactory: Pushdown Predicates of FIL 
For Alias : acct
15/06/03 15:45:39 [main]: INFO ppd.OpProcFactory:       (_col20 = 120)

15/06/03 15:46:23 [main]: INFO ppd.OpProcFactory: Processing for JOIN(24)
15/06/03 15:46:23 [main]: INFO ppd.OpProcFactory: Pushdown Predicates of JOIN 
For Alias : acct
15/06/03 15:46:23 [main]: INFO ppd.OpProcFactory:       (VALUE._col19 = 120)

15/06/03 15:46:26 [main]: INFO ppd.OpProcFactory: Processing for RS(21)
15/06/03 15:46:26 [main]: INFO ppd.OpProcFactory: Pushdown Predicates of RS For 
Alias : acct
15/06/03 15:46:26 [main]: INFO ppd.OpProcFactory:       (_col20 = 120)

15/06/03 15:46:43 [main]: INFO ppd.OpProcFactory: Processing for FIL(20)
15/06/03 15:46:49 [main]: INFO ppd.OpProcFactory: Pushdown Predicates of FIL 
For Alias : acct
15/06/03 15:46:49 [main]: INFO ppd.OpProcFactory:       (_col20 = 120)

15/06/03 15:46:52 [main]: INFO ppd.OpProcFactory: Processing for JOIN(19)
15/06/03 15:46:52 [main]: INFO ppd.OpProcFactory: Pushdown Predicates of JOIN 
For Alias : acct
15/06/03 15:46:52 [main]: INFO ppd.OpProcFactory:       (VALUE._col1 = 120)

15/06/03 15:59:18 [main]: INFO ppd.OpProcFactory: Processing for RS(18)
15/06/03 15:59:18 [main]: INFO ppd.OpProcFactory: Pushdown Predicates of RS For 
Alias : acct
15/06/03 15:59:18 [main]: INFO ppd.OpProcFactory:       (brn = 120)

15/06/03 15:59:19 [main]: INFO ppd.OpProcFactory: Processing for FIL(17)
15/06/03 15:59:50 [main]: INFO ppd.OpProcFactory: Pushdown Predicates of FIL 
For Alias : acct
15/06/03 15:59:50 [main]: INFO ppd.OpProcFactory:       (brn = 120)

15/06/03 16:00:20 [main]: INFO ppd.OpProcFactory: Processing for TS(4)
15/06/03 16:00:20 [main]: INFO ppd.OpProcFactory: Pushdown Predicates of TS For 
Alias : acct
15/06/03 16:00:20 [main]: INFO ppd.OpProcFactory:       aid is not null
15/06/03 16:00:20 [main]: INFO ppd.OpProcFactory:       (brn = 120)

15/06/03 16:01:38 [main]: INFO optimizer.ConstantPropagateProcFactory: expr 
Const int 120 fold from Column[VALUE._col19] is removed.

15/06/03 16:01:38 [main]: INFO optimizer.ColumnPrunerProcFactory: RS 21 
oldColExprMap: {VALUE._col5=Column[_col5], VALUE._col4=Const int 4436, 
VALUE._col3=Column[_col3], VALUE._col2=Column[_col2], 
VALUE._col1=Column[_col1], VALUE._col0=Const int 4436, 
KEY.reducesinkkey0=Column[_col6], VALUE._col14=Column[_col15], 
VALUE._col13=Column[_col14], VALUE._col16=Column[_col17], 
VALUE._col15=Column[_col16], VALUE._col18=Column[_col19], VALUE._col9=Const int 
4436, VALUE._col17=Column[_col18], VALUE._col8=Column[_col9], 
VALUE._col7=Column[_col8], VALUE._col19=Const int 120, 
VALUE._col6=Column[_col7], VALUE._col20=Column[_col21], 
VALUE._col11=Column[_col12], VALUE._col21=Column[_col22], 
VALUE._col12=Column[_col13], VALUE._col22=Column[_col23], 
VALUE._col10=Column[_col11]}

15/06/03 16:01:38 [main]: INFO optimizer.ColumnPrunerProcFactory: RS 18 
oldColExprMap: {VALUE._col4=Column[ROW__ID], 
VALUE._col3=Column[INPUT__FILE__NAME], 
VALUE._col2=Column[BLOCK__OFFSET__INSIDE__FILE], VALUE._col1=Const int 120, 
VALUE._col0=Column[acc_n], KEY.reducesinkkey0=Column[aid]}

STAGE PLANS:
  Stage: Stage-12
        acct 
          TableScan
            alias: acct
            Statistics: Num rows: 5 Data size: 63 Basic stats: COMPLETE Column 
stats: NONE
            Filter Operator
              predicate: (aid is not null and (brn = 120)) (type: boolean


  Stage: Stage-9
    Map Reduce
                    Select Operator
                      expressions: _col19 (type: int), 120 (type: int)
{code}

> [WHERE col is not null] does not work sometimes for queries with many JOIN 
> statements
> -------------------------------------------------------------------------------------
>
>                 Key: HIVE-10841
>                 URL: https://issues.apache.org/jira/browse/HIVE-10841
>             Project: Hive
>          Issue Type: Bug
>          Components: Query Planning, Query Processor
>    Affects Versions: 0.13.0, 0.14.0, 0.13.1, 1.2.0
>            Reporter: Alexander Pivovarov
>            Assignee: Alexander Pivovarov
>         Attachments: HIVE-10841.patch
>
>
> The result from the following SELECT query is 3 rows but it should be 1 row.
> I checked it in MySQL - it returned 1 row.
> To reproduce the issue in Hive
> 1. prepare tables
> {code}
> drop table if exists L;
> drop table if exists LA;
> drop table if exists FR;
> drop table if exists A;
> drop table if exists PI;
> drop table if exists acct;
> create table L as select 4436 id;
> create table LA as select 4436 loan_id, 4748 aid, 4415 pi_id;
> create table FR as select 4436 loan_id;
> create table A as select 4748 id;
> create table PI as select 4415 id;
> create table acct as select 4748 aid, 10 acc_n, 122 brn;
> insert into table acct values(4748, null, null);
> insert into table acct values(4748, null, null);
> {code}
> 2. run SELECT query
> {code}
> select
>   acct.ACC_N,
>   acct.brn
> FROM L
> JOIN LA ON L.id = LA.loan_id
> JOIN FR ON L.id = FR.loan_id
> JOIN A ON LA.aid = A.id
> JOIN PI ON PI.id = LA.pi_id
> JOIN acct ON A.id = acct.aid
> WHERE
>   L.id = 4436
>   and acct.brn is not null;
> {code}
> the result is 3 rows
> {code}
> 10    122
> NULL  NULL
> NULL  NULL
> {code}
> but it should be 1 row
> {code}
> 10    122
> {code}
> 2.1 "explain select ..." output for hive-1.3.0 MR
> {code}
> STAGE DEPENDENCIES:
>   Stage-12 is a root stage
>   Stage-9 depends on stages: Stage-12
>   Stage-0 depends on stages: Stage-9
> STAGE PLANS:
>   Stage: Stage-12
>     Map Reduce Local Work
>       Alias -> Map Local Tables:
>         a 
>           Fetch Operator
>             limit: -1
>         acct 
>           Fetch Operator
>             limit: -1
>         fr 
>           Fetch Operator
>             limit: -1
>         l 
>           Fetch Operator
>             limit: -1
>         pi 
>           Fetch Operator
>             limit: -1
>       Alias -> Map Local Operator Tree:
>         a 
>           TableScan
>             alias: a
>             Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column 
> stats: NONE
>             Filter Operator
>               predicate: id is not null (type: boolean)
>               Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE 
> Column stats: NONE
>               HashTable Sink Operator
>                 keys:
>                   0 _col5 (type: int)
>                   1 id (type: int)
>                   2 aid (type: int)
>         acct 
>           TableScan
>             alias: acct
>             Statistics: Num rows: 3 Data size: 31 Basic stats: COMPLETE 
> Column stats: NONE
>             Filter Operator
>               predicate: aid is not null (type: boolean)
>               Statistics: Num rows: 2 Data size: 20 Basic stats: COMPLETE 
> Column stats: NONE
>               HashTable Sink Operator
>                 keys:
>                   0 _col5 (type: int)
>                   1 id (type: int)
>                   2 aid (type: int)
>         fr 
>           TableScan
>             alias: fr
>             Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column 
> stats: NONE
>             Filter Operator
>               predicate: (loan_id = 4436) (type: boolean)
>               Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE 
> Column stats: NONE
>               HashTable Sink Operator
>                 keys:
>                   0 4436 (type: int)
>                   1 4436 (type: int)
>                   2 4436 (type: int)
>         l 
>           TableScan
>             alias: l
>             Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column 
> stats: NONE
>             Filter Operator
>               predicate: (id = 4436) (type: boolean)
>               Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE 
> Column stats: NONE
>               HashTable Sink Operator
>                 keys:
>                   0 4436 (type: int)
>                   1 4436 (type: int)
>                   2 4436 (type: int)
>         pi 
>           TableScan
>             alias: pi
>             Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column 
> stats: NONE
>             Filter Operator
>               predicate: id is not null (type: boolean)
>               Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE 
> Column stats: NONE
>               HashTable Sink Operator
>                 keys:
>                   0 _col6 (type: int)
>                   1 id (type: int)
>   Stage: Stage-9
>     Map Reduce
>       Map Operator Tree:
>           TableScan
>             alias: la
>             Statistics: Num rows: 1 Data size: 14 Basic stats: COMPLETE 
> Column stats: NONE
>             Filter Operator
>               predicate: (((loan_id is not null and aid is not null) and 
> pi_id is not null) and (loan_id = 4436)) (type: boolean)
>               Statistics: Num rows: 1 Data size: 14 Basic stats: COMPLETE 
> Column stats: NONE
>               Map Join Operator
>                 condition map:
>                      Inner Join 0 to 1
>                      Inner Join 0 to 2
>                 keys:
>                   0 4436 (type: int)
>                   1 4436 (type: int)
>                   2 4436 (type: int)
>                 outputColumnNames: _col5, _col6
>                 Statistics: Num rows: 2 Data size: 8 Basic stats: COMPLETE 
> Column stats: NONE
>                 Map Join Operator
>                   condition map:
>                        Inner Join 0 to 1
>                        Inner Join 1 to 2
>                   keys:
>                     0 _col5 (type: int)
>                     1 id (type: int)
>                     2 aid (type: int)
>                   outputColumnNames: _col6, _col19, _col20
>                   Statistics: Num rows: 4 Data size: 17 Basic stats: COMPLETE 
> Column stats: NONE
>                   Map Join Operator
>                     condition map:
>                          Inner Join 0 to 1
>                     keys:
>                       0 _col6 (type: int)
>                       1 id (type: int)
>                     outputColumnNames: _col19, _col20
>                     Statistics: Num rows: 4 Data size: 18 Basic stats: 
> COMPLETE Column stats: NONE
>                     Select Operator
>                       expressions: _col19 (type: int), _col20 (type: int)
>                       outputColumnNames: _col0, _col1
>                       Statistics: Num rows: 4 Data size: 18 Basic stats: 
> COMPLETE Column stats: NONE
>                       File Output Operator
>                         compressed: false
>                         Statistics: Num rows: 4 Data size: 18 Basic stats: 
> COMPLETE Column stats: NONE
>                         table:
>                             input format: 
> org.apache.hadoop.mapred.TextInputFormat
>                             output format: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>                             serde: 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>       Local Work:
>         Map Reduce Local Work
>   Stage: Stage-0
>     Fetch Operator
>       limit: -1
>       Processor Tree:
>         ListSink
> Time taken: 0.57 seconds, Fetched: 142 row(s)
> {code}
> 2.2. "explain select..." output for hive-0.13.1 Tez
> {code}
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 is a root stage
> STAGE PLANS:
>   Stage: Stage-1
>     Tez
>       Edges:
>         Reducer 2 <- Map 1 (SIMPLE_EDGE), Map 4 (SIMPLE_EDGE), Reducer 6 
> (SIMPLE_EDGE)
>         Reducer 3 <- Reducer 2 (SIMPLE_EDGE), Map 9 (SIMPLE_EDGE)
>         Reducer 6 <- Map 5 (SIMPLE_EDGE), Map 7 (SIMPLE_EDGE), Map 8 
> (SIMPLE_EDGE)
>       DagName: lcapp_20150528111717_06c57a5b-8dc6-4ce9-bce7-b9e0a7818fe4:1
>       Vertices:
>         Map 1 
>             Map Operator Tree:
>                 TableScan
>                   alias: acct
>                   Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE 
> Column stats: NONE
>                   Reduce Output Operator
>                     key expressions: aid (type: int)
>                     sort order: +
>                     Map-reduce partition columns: aid (type: int)
>                     Statistics: Num rows: 1 Data size: 4 Basic stats: 
> COMPLETE Column stats: NONE
>                     value expressions: acc_n (type: int), brn (type: int)
>         Map 4 
>             Map Operator Tree:
>                 TableScan
>                   alias: a
>                   Statistics: Num rows: 46 Data size: 187 Basic stats: 
> COMPLETE Column stats: NONE
>                   Reduce Output Operator
>                     key expressions: id (type: int)
>                     sort order: +
>                     Map-reduce partition columns: id (type: int)
>                     Statistics: Num rows: 46 Data size: 187 Basic stats: 
> COMPLETE Column stats: NONE
>         Map 5 
>             Map Operator Tree:
>                 TableScan
>                   alias: la
>                   Statistics: Num rows: 28 Data size: 347 Basic stats: 
> COMPLETE Column stats: NONE
>                   Filter Operator
>                     predicate: (loan_id = 4436) (type: boolean)
>                     Statistics: Num rows: 14 Data size: 173 Basic stats: 
> COMPLETE Column stats: NONE
>                     Reduce Output Operator
>                       key expressions: loan_id (type: int)
>                       sort order: +
>                       Map-reduce partition columns: loan_id (type: int)
>                       Statistics: Num rows: 14 Data size: 173 Basic stats: 
> COMPLETE Column stats: NONE
>                       value expressions: aid (type: int), pi_id (type: int)
>         Map 7 
>             Map Operator Tree:
>                 TableScan
>                   alias: fr
>                   Statistics: Num rows: 46 Data size: 187 Basic stats: 
> COMPLETE Column stats: NONE
>                   Filter Operator
>                     predicate: (loan_id = 4436) (type: boolean)
>                     Statistics: Num rows: 23 Data size: 93 Basic stats: 
> COMPLETE Column stats: NONE
>                     Reduce Output Operator
>                       key expressions: loan_id (type: int)
>                       sort order: +
>                       Map-reduce partition columns: loan_id (type: int)
>                       Statistics: Num rows: 23 Data size: 93 Basic stats: 
> COMPLETE Column stats: NONE
>         Map 8 
>             Map Operator Tree:
>                 TableScan
>                   alias: l
>                   Statistics: Num rows: 46 Data size: 187 Basic stats: 
> COMPLETE Column stats: NONE
>                   Filter Operator
>                     predicate: (id = 4436) (type: boolean)
>                     Statistics: Num rows: 23 Data size: 93 Basic stats: 
> COMPLETE Column stats: NONE
>                     Reduce Output Operator
>                       key expressions: id (type: int)
>                       sort order: +
>                       Map-reduce partition columns: id (type: int)
>                       Statistics: Num rows: 23 Data size: 93 Basic stats: 
> COMPLETE Column stats: NONE
>         Map 9 
>             Map Operator Tree:
>                 TableScan
>                   alias: pi
>                   Statistics: Num rows: 46 Data size: 187 Basic stats: 
> COMPLETE Column stats: NONE
>                   Reduce Output Operator
>                     key expressions: id (type: int)
>                     sort order: +
>                     Map-reduce partition columns: id (type: int)
>                     Statistics: Num rows: 46 Data size: 187 Basic stats: 
> COMPLETE Column stats: NONE
>         Reducer 2 
>             Reduce Operator Tree:
>               Join Operator
>                 condition map:
>                      Inner Join 0 to 1
>                      Inner Join 1 to 2
>                 condition expressions:
>                   0 {VALUE._col2}
>                   1 
>                   2 {VALUE._col1} {VALUE._col2}
>                 outputColumnNames: _col2, _col15, _col16
>                 Statistics: Num rows: 110 Data size: 448 Basic stats: 
> COMPLETE Column stats: NONE
>                 Reduce Output Operator
>                   key expressions: _col2 (type: int)
>                   sort order: +
>                   Map-reduce partition columns: _col2 (type: int)
>                   Statistics: Num rows: 110 Data size: 448 Basic stats: 
> COMPLETE Column stats: NONE
>                   value expressions: _col15 (type: int), _col16 (type: int)
>         Reducer 3 
>             Reduce Operator Tree:
>               Join Operator
>                 condition map:
>                      Inner Join 0 to 1
>                 condition expressions:
>                   0 {VALUE._col1} {VALUE._col2}
>                   1 
>                 outputColumnNames: _col1, _col2
>                 Statistics: Num rows: 121 Data size: 492 Basic stats: 
> COMPLETE Column stats: NONE
>                 Select Operator
>                   expressions: _col1 (type: int), _col2 (type: int)
>                   outputColumnNames: _col0, _col1
>                   Statistics: Num rows: 121 Data size: 492 Basic stats: 
> COMPLETE Column stats: NONE
>                   File Output Operator
>                     compressed: false
>                     Statistics: Num rows: 121 Data size: 492 Basic stats: 
> COMPLETE Column stats: NONE
>                     table:
>                         input format: org.apache.hadoop.mapred.TextInputFormat
>                         output format: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>                         serde: 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>         Reducer 6 
>             Reduce Operator Tree:
>               Join Operator
>                 condition map:
>                      Inner Join 0 to 1
>                      Inner Join 0 to 2
>                 condition expressions:
>                   0 
>                   1 {VALUE._col1} {VALUE._col2}
>                   2 
>                 outputColumnNames: _col4, _col5
>                 Statistics: Num rows: 50 Data size: 204 Basic stats: COMPLETE 
> Column stats: NONE
>                 Reduce Output Operator
>                   key expressions: _col4 (type: int)
>                   sort order: +
>                   Map-reduce partition columns: _col4 (type: int)
>                   Statistics: Num rows: 50 Data size: 204 Basic stats: 
> COMPLETE Column stats: NONE
>                   value expressions: _col5 (type: int)
>   Stage: Stage-0
>     Fetch Operator
>       limit: -1
> Time taken: 1.377 seconds, Fetched: 146 row(s)
> {code}
> 3. The workaround is to put "acct.brn is not null" to join condition
> {code}
> select
>   acct.ACC_N,
>   acct.brn
> FROM L
> JOIN LA ON L.id = LA.loan_id
> JOIN FR ON L.id = FR.loan_id
> JOIN A ON LA.aid = A.id
> JOIN PI ON PI.id = LA.pi_id
> JOIN acct ON A.id = acct.aid and acct.brn is not null
> WHERE
>   L.id = 4436;
> OK
> 10    122
> Time taken: 23.479 seconds, Fetched: 1 row(s)
> {code}
> I tried it on hive-1.3.0 (MR) and hive-0.13.1 (MR and Tez) - all combinations 
> have the issue



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-10841) [WHERE col is not null] does not work sometimes for queries with many JOIN statements

Reply via email to