[jira] [Updated] (HIVE-10841) [WHERE col is not null] does not work sometimes for queries with many JOIN statements

Alexander Pivovarov (JIRA) Thu, 28 May 2015 11:17:08 -0700

     [ 
https://issues.apache.org/jira/browse/HIVE-10841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Alexander Pivovarov updated HIVE-10841:
---------------------------------------
    Description: 
The result from the following SELECT query is 3 rows but it should be 1 row.
I checked it in MySQL - it returned 1 row.

To reproduce the issue in Hive
1. prepare tables
{code}
drop table if exists L;
drop table if exists LA;
drop table if exists FR;
drop table if exists A;
drop table if exists PI;
drop table if exists acct;

create table L as select 4436 id;
create table LA as select 4436 loan_id, 4748 aid, 4415 pi_id;
create table FR as select 4436 loan_id;
create table A as select 4748 id;
create table PI as select 4415 id;

create table acct as select 4748 aid, 10 acc_n, 122 brn;
insert into table acct values(4748, null, null);
insert into table acct values(4748, null, null);
{code}

2. run SELECT query
{code}
select
  acct.ACC_N,
  acct.brn
FROM L
JOIN LA ON L.id = LA.loan_id
JOIN FR ON L.id = FR.loan_id
JOIN A ON LA.aid = A.id
JOIN PI ON PI.id = LA.pi_id
JOIN acct ON A.id = acct.aid
WHERE
  L.id = 4436
  and acct.brn is not null;
{code}

the result is 3 rows
{code}
10      122
NULL    NULL
NULL    NULL
{code}

but it should be 1 row

{code}
10      122
{code}

2.1 "explain select ..." output
{code}
STAGE DEPENDENCIES:
  Stage-12 is a root stage
  Stage-9 depends on stages: Stage-12
  Stage-0 depends on stages: Stage-9

STAGE PLANS:
  Stage: Stage-12
    Map Reduce Local Work
      Alias -> Map Local Tables:
        a 
          Fetch Operator
            limit: -1
        acct 
          Fetch Operator
            limit: -1
        fr 
          Fetch Operator
            limit: -1
        l 
          Fetch Operator
            limit: -1
        pi 
          Fetch Operator
            limit: -1
      Alias -> Map Local Operator Tree:
        a 
          TableScan
            alias: a
            Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column 
stats: NONE
            Filter Operator
              predicate: id is not null (type: boolean)
              Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column 
stats: NONE
              HashTable Sink Operator
                keys:
                  0 _col5 (type: int)
                  1 id (type: int)
                  2 aid (type: int)
        acct 
          TableScan
            alias: acct
            Statistics: Num rows: 3 Data size: 31 Basic stats: COMPLETE Column 
stats: NONE
            Filter Operator
              predicate: aid is not null (type: boolean)
              Statistics: Num rows: 2 Data size: 20 Basic stats: COMPLETE 
Column stats: NONE
              HashTable Sink Operator
                keys:
                  0 _col5 (type: int)
                  1 id (type: int)
                  2 aid (type: int)
        fr 
          TableScan
            alias: fr
            Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column 
stats: NONE
            Filter Operator
              predicate: (loan_id = 4436) (type: boolean)
              Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column 
stats: NONE
              HashTable Sink Operator
                keys:
                  0 4436 (type: int)
                  1 4436 (type: int)
                  2 4436 (type: int)
        l 
          TableScan
            alias: l
            Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column 
stats: NONE
            Filter Operator
              predicate: (id = 4436) (type: boolean)
              Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column 
stats: NONE
              HashTable Sink Operator
                keys:
                  0 4436 (type: int)
                  1 4436 (type: int)
                  2 4436 (type: int)
        pi 
          TableScan
            alias: pi
            Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column 
stats: NONE
            Filter Operator
              predicate: id is not null (type: boolean)
              Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column 
stats: NONE
              HashTable Sink Operator
                keys:
                  0 _col6 (type: int)
                  1 id (type: int)

  Stage: Stage-9
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: la
            Statistics: Num rows: 1 Data size: 14 Basic stats: COMPLETE Column 
stats: NONE
            Filter Operator
              predicate: (((loan_id is not null and aid is not null) and pi_id 
is not null) and (loan_id = 4436)) (type: boolean)
              Statistics: Num rows: 1 Data size: 14 Basic stats: COMPLETE 
Column stats: NONE
              Map Join Operator
                condition map:
                     Inner Join 0 to 1
                     Inner Join 0 to 2
                keys:
                  0 4436 (type: int)
                  1 4436 (type: int)
                  2 4436 (type: int)
                outputColumnNames: _col5, _col6
                Statistics: Num rows: 2 Data size: 8 Basic stats: COMPLETE 
Column stats: NONE
                Map Join Operator
                  condition map:
                       Inner Join 0 to 1
                       Inner Join 1 to 2
                  keys:
                    0 _col5 (type: int)
                    1 id (type: int)
                    2 aid (type: int)
                  outputColumnNames: _col6, _col19, _col20
                  Statistics: Num rows: 4 Data size: 17 Basic stats: COMPLETE 
Column stats: NONE
                  Map Join Operator
                    condition map:
                         Inner Join 0 to 1
                    keys:
                      0 _col6 (type: int)
                      1 id (type: int)
                    outputColumnNames: _col19, _col20
                    Statistics: Num rows: 4 Data size: 18 Basic stats: COMPLETE 
Column stats: NONE
                    Select Operator
                      expressions: _col19 (type: int), _col20 (type: int)
                      outputColumnNames: _col0, _col1
                      Statistics: Num rows: 4 Data size: 18 Basic stats: 
COMPLETE Column stats: NONE
                      File Output Operator
                        compressed: false
                        Statistics: Num rows: 4 Data size: 18 Basic stats: 
COMPLETE Column stats: NONE
                        table:
                            input format: 
org.apache.hadoop.mapred.TextInputFormat
                            output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                            serde: 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
      Local Work:
        Map Reduce Local Work

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

Time taken: 0.57 seconds, Fetched: 142 row(s)
{code}

3. The workaround is to put "acct.brn is not null" to join condition
{code}
select
  acct.ACC_N,
  acct.brn
FROM L
JOIN LA ON L.id = LA.loan_id
JOIN FR ON L.id = FR.loan_id
JOIN A ON LA.aid = A.id
JOIN PI ON PI.id = LA.pi_id
JOIN acct ON A.id = acct.aid and acct.brn is not null
WHERE
  L.id = 4436;

OK
10      122
Time taken: 23.479 seconds, Fetched: 1 row(s)
{code}

I tried it on hive-1.3.0 (MR) and hive-0.13.1 (MR and Tez) - all combinations 
have the issue

  was:
The result from the following SELECT query is 3 rows but it should be 1 row.
I checked it in MySQL - it returned 1 row.

To reproduce the issue in Hive
1. prepare tables
{code}
drop table if exists L;
drop table if exists LA;
drop table if exists FR;
drop table if exists A;
drop table if exists PI;
drop table if exists acct;

create table L as select 4436 id;
create table LA as select 4436 loan_id, 4748 aid, 4415 pi_id;
create table FR as select 4436 loan_id;
create table A as select 4748 id;
create table PI as select 4415 id;

create table acct as select 4748 aid, 10 acc_n, 122 brn;
insert into table acct values(4748, null, null);
insert into table acct values(4748, null, null);
{code}

2. run SELECT query
{code}
select
  acct.ACC_N,
  acct.brn
FROM L
JOIN LA ON L.id = LA.loan_id
JOIN FR ON L.id = FR.loan_id
JOIN A ON LA.aid = A.id
JOIN PI ON PI.id = LA.pi_id
JOIN acct ON A.id = acct.aid
WHERE
  L.id = 4436
  and acct.brn is not null;
{code}

the result is 3 rows
{code}
10      122
NULL    NULL
NULL    NULL
{code}

but it should be 1 row

{code}
10      122
{code}

3. The workaround is to put "acct.brn is not null" to join condition
{code}
select
  acct.ACC_N,
  acct.brn
FROM L
JOIN LA ON L.id = LA.loan_id
JOIN FR ON L.id = FR.loan_id
JOIN A ON LA.aid = A.id
JOIN PI ON PI.id = LA.pi_id
JOIN acct ON A.id = acct.aid and acct.brn is not null
WHERE
  L.id = 4436;

OK
10      122
Time taken: 23.479 seconds, Fetched: 1 row(s)
{code}

I tried it on hive-1.3.0 (MR) and hive-0.13.1 (MR and Tez) - all combinations 
have the issue


> [WHERE col is not null] does not work sometimes for queries with many JOIN 
> statements
> -------------------------------------------------------------------------------------
>
>                 Key: HIVE-10841
>                 URL: https://issues.apache.org/jira/browse/HIVE-10841
>             Project: Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Alexander Pivovarov
>
> The result from the following SELECT query is 3 rows but it should be 1 row.
> I checked it in MySQL - it returned 1 row.
> To reproduce the issue in Hive
> 1. prepare tables
> {code}
> drop table if exists L;
> drop table if exists LA;
> drop table if exists FR;
> drop table if exists A;
> drop table if exists PI;
> drop table if exists acct;
> create table L as select 4436 id;
> create table LA as select 4436 loan_id, 4748 aid, 4415 pi_id;
> create table FR as select 4436 loan_id;
> create table A as select 4748 id;
> create table PI as select 4415 id;
> create table acct as select 4748 aid, 10 acc_n, 122 brn;
> insert into table acct values(4748, null, null);
> insert into table acct values(4748, null, null);
> {code}
> 2. run SELECT query
> {code}
> select
>   acct.ACC_N,
>   acct.brn
> FROM L
> JOIN LA ON L.id = LA.loan_id
> JOIN FR ON L.id = FR.loan_id
> JOIN A ON LA.aid = A.id
> JOIN PI ON PI.id = LA.pi_id
> JOIN acct ON A.id = acct.aid
> WHERE
>   L.id = 4436
>   and acct.brn is not null;
> {code}
> the result is 3 rows
> {code}
> 10    122
> NULL  NULL
> NULL  NULL
> {code}
> but it should be 1 row
> {code}
> 10    122
> {code}
> 2.1 "explain select ..." output
> {code}
> STAGE DEPENDENCIES:
>   Stage-12 is a root stage
>   Stage-9 depends on stages: Stage-12
>   Stage-0 depends on stages: Stage-9
> STAGE PLANS:
>   Stage: Stage-12
>     Map Reduce Local Work
>       Alias -> Map Local Tables:
>         a 
>           Fetch Operator
>             limit: -1
>         acct 
>           Fetch Operator
>             limit: -1
>         fr 
>           Fetch Operator
>             limit: -1
>         l 
>           Fetch Operator
>             limit: -1
>         pi 
>           Fetch Operator
>             limit: -1
>       Alias -> Map Local Operator Tree:
>         a 
>           TableScan
>             alias: a
>             Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column 
> stats: NONE
>             Filter Operator
>               predicate: id is not null (type: boolean)
>               Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE 
> Column stats: NONE
>               HashTable Sink Operator
>                 keys:
>                   0 _col5 (type: int)
>                   1 id (type: int)
>                   2 aid (type: int)
>         acct 
>           TableScan
>             alias: acct
>             Statistics: Num rows: 3 Data size: 31 Basic stats: COMPLETE 
> Column stats: NONE
>             Filter Operator
>               predicate: aid is not null (type: boolean)
>               Statistics: Num rows: 2 Data size: 20 Basic stats: COMPLETE 
> Column stats: NONE
>               HashTable Sink Operator
>                 keys:
>                   0 _col5 (type: int)
>                   1 id (type: int)
>                   2 aid (type: int)
>         fr 
>           TableScan
>             alias: fr
>             Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column 
> stats: NONE
>             Filter Operator
>               predicate: (loan_id = 4436) (type: boolean)
>               Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE 
> Column stats: NONE
>               HashTable Sink Operator
>                 keys:
>                   0 4436 (type: int)
>                   1 4436 (type: int)
>                   2 4436 (type: int)
>         l 
>           TableScan
>             alias: l
>             Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column 
> stats: NONE
>             Filter Operator
>               predicate: (id = 4436) (type: boolean)
>               Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE 
> Column stats: NONE
>               HashTable Sink Operator
>                 keys:
>                   0 4436 (type: int)
>                   1 4436 (type: int)
>                   2 4436 (type: int)
>         pi 
>           TableScan
>             alias: pi
>             Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column 
> stats: NONE
>             Filter Operator
>               predicate: id is not null (type: boolean)
>               Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE 
> Column stats: NONE
>               HashTable Sink Operator
>                 keys:
>                   0 _col6 (type: int)
>                   1 id (type: int)
>   Stage: Stage-9
>     Map Reduce
>       Map Operator Tree:
>           TableScan
>             alias: la
>             Statistics: Num rows: 1 Data size: 14 Basic stats: COMPLETE 
> Column stats: NONE
>             Filter Operator
>               predicate: (((loan_id is not null and aid is not null) and 
> pi_id is not null) and (loan_id = 4436)) (type: boolean)
>               Statistics: Num rows: 1 Data size: 14 Basic stats: COMPLETE 
> Column stats: NONE
>               Map Join Operator
>                 condition map:
>                      Inner Join 0 to 1
>                      Inner Join 0 to 2
>                 keys:
>                   0 4436 (type: int)
>                   1 4436 (type: int)
>                   2 4436 (type: int)
>                 outputColumnNames: _col5, _col6
>                 Statistics: Num rows: 2 Data size: 8 Basic stats: COMPLETE 
> Column stats: NONE
>                 Map Join Operator
>                   condition map:
>                        Inner Join 0 to 1
>                        Inner Join 1 to 2
>                   keys:
>                     0 _col5 (type: int)
>                     1 id (type: int)
>                     2 aid (type: int)
>                   outputColumnNames: _col6, _col19, _col20
>                   Statistics: Num rows: 4 Data size: 17 Basic stats: COMPLETE 
> Column stats: NONE
>                   Map Join Operator
>                     condition map:
>                          Inner Join 0 to 1
>                     keys:
>                       0 _col6 (type: int)
>                       1 id (type: int)
>                     outputColumnNames: _col19, _col20
>                     Statistics: Num rows: 4 Data size: 18 Basic stats: 
> COMPLETE Column stats: NONE
>                     Select Operator
>                       expressions: _col19 (type: int), _col20 (type: int)
>                       outputColumnNames: _col0, _col1
>                       Statistics: Num rows: 4 Data size: 18 Basic stats: 
> COMPLETE Column stats: NONE
>                       File Output Operator
>                         compressed: false
>                         Statistics: Num rows: 4 Data size: 18 Basic stats: 
> COMPLETE Column stats: NONE
>                         table:
>                             input format: 
> org.apache.hadoop.mapred.TextInputFormat
>                             output format: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>                             serde: 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>       Local Work:
>         Map Reduce Local Work
>   Stage: Stage-0
>     Fetch Operator
>       limit: -1
>       Processor Tree:
>         ListSink
> Time taken: 0.57 seconds, Fetched: 142 row(s)
> {code}
> 3. The workaround is to put "acct.brn is not null" to join condition
> {code}
> select
>   acct.ACC_N,
>   acct.brn
> FROM L
> JOIN LA ON L.id = LA.loan_id
> JOIN FR ON L.id = FR.loan_id
> JOIN A ON LA.aid = A.id
> JOIN PI ON PI.id = LA.pi_id
> JOIN acct ON A.id = acct.aid and acct.brn is not null
> WHERE
>   L.id = 4436;
> OK
> 10    122
> Time taken: 23.479 seconds, Fetched: 1 row(s)
> {code}
> I tried it on hive-1.3.0 (MR) and hive-0.13.1 (MR and Tez) - all combinations 
> have the issue



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-10841) [WHERE col is not null] does not work sometimes for queries with many JOIN statements

Reply via email to