[
https://issues.apache.org/jira/browse/HIVE-10841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alexander Pivovarov updated HIVE-10841:
---------------------------------------
Description:
The result from the following SELECT query is 3 rows but it should be 1 row.
I checked it in MySQL - it returned 1 row.
To reproduce the issue in Hive
1. prepare tables
{code}
drop table if exists L;
drop table if exists LA;
drop table if exists FR;
drop table if exists A;
drop table if exists PI;
drop table if exists acct;
create table L as select 4436 id;
create table LA as select 4436 loan_id, 4748 aid, 4415 pi_id;
create table FR as select 4436 loan_id;
create table A as select 4748 id;
create table PI as select 4415 id;
create table acct as select 4748 aid, 10 acc_n, 122 brn;
insert into table acct values(4748, null, null);
insert into table acct values(4748, null, null);
{code}
2. run SELECT query
{code}
select
acct.ACC_N,
acct.brn
FROM L
JOIN LA ON L.id = LA.loan_id
JOIN FR ON L.id = FR.loan_id
JOIN A ON LA.aid = A.id
JOIN PI ON PI.id = LA.pi_id
JOIN acct ON A.id = acct.aid
WHERE
L.id = 4436
and acct.brn is not null;
{code}
the result is 3 rows
{code}
10 122
NULL NULL
NULL NULL
{code}
but it should be 1 row
{code}
10 122
{code}
2.1 "explain select ..." output
{code}
STAGE DEPENDENCIES:
Stage-12 is a root stage
Stage-9 depends on stages: Stage-12
Stage-0 depends on stages: Stage-9
STAGE PLANS:
Stage: Stage-12
Map Reduce Local Work
Alias -> Map Local Tables:
a
Fetch Operator
limit: -1
acct
Fetch Operator
limit: -1
fr
Fetch Operator
limit: -1
l
Fetch Operator
limit: -1
pi
Fetch Operator
limit: -1
Alias -> Map Local Operator Tree:
a
TableScan
alias: a
Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column
stats: NONE
Filter Operator
predicate: id is not null (type: boolean)
Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column
stats: NONE
HashTable Sink Operator
keys:
0 _col5 (type: int)
1 id (type: int)
2 aid (type: int)
acct
TableScan
alias: acct
Statistics: Num rows: 3 Data size: 31 Basic stats: COMPLETE Column
stats: NONE
Filter Operator
predicate: aid is not null (type: boolean)
Statistics: Num rows: 2 Data size: 20 Basic stats: COMPLETE
Column stats: NONE
HashTable Sink Operator
keys:
0 _col5 (type: int)
1 id (type: int)
2 aid (type: int)
fr
TableScan
alias: fr
Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column
stats: NONE
Filter Operator
predicate: (loan_id = 4436) (type: boolean)
Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column
stats: NONE
HashTable Sink Operator
keys:
0 4436 (type: int)
1 4436 (type: int)
2 4436 (type: int)
l
TableScan
alias: l
Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column
stats: NONE
Filter Operator
predicate: (id = 4436) (type: boolean)
Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column
stats: NONE
HashTable Sink Operator
keys:
0 4436 (type: int)
1 4436 (type: int)
2 4436 (type: int)
pi
TableScan
alias: pi
Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column
stats: NONE
Filter Operator
predicate: id is not null (type: boolean)
Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column
stats: NONE
HashTable Sink Operator
keys:
0 _col6 (type: int)
1 id (type: int)
Stage: Stage-9
Map Reduce
Map Operator Tree:
TableScan
alias: la
Statistics: Num rows: 1 Data size: 14 Basic stats: COMPLETE Column
stats: NONE
Filter Operator
predicate: (((loan_id is not null and aid is not null) and pi_id
is not null) and (loan_id = 4436)) (type: boolean)
Statistics: Num rows: 1 Data size: 14 Basic stats: COMPLETE
Column stats: NONE
Map Join Operator
condition map:
Inner Join 0 to 1
Inner Join 0 to 2
keys:
0 4436 (type: int)
1 4436 (type: int)
2 4436 (type: int)
outputColumnNames: _col5, _col6
Statistics: Num rows: 2 Data size: 8 Basic stats: COMPLETE
Column stats: NONE
Map Join Operator
condition map:
Inner Join 0 to 1
Inner Join 1 to 2
keys:
0 _col5 (type: int)
1 id (type: int)
2 aid (type: int)
outputColumnNames: _col6, _col19, _col20
Statistics: Num rows: 4 Data size: 17 Basic stats: COMPLETE
Column stats: NONE
Map Join Operator
condition map:
Inner Join 0 to 1
keys:
0 _col6 (type: int)
1 id (type: int)
outputColumnNames: _col19, _col20
Statistics: Num rows: 4 Data size: 18 Basic stats: COMPLETE
Column stats: NONE
Select Operator
expressions: _col19 (type: int), _col20 (type: int)
outputColumnNames: _col0, _col1
Statistics: Num rows: 4 Data size: 18 Basic stats:
COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 4 Data size: 18 Basic stats:
COMPLETE Column stats: NONE
table:
input format:
org.apache.hadoop.mapred.TextInputFormat
output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde:
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Local Work:
Map Reduce Local Work
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
Time taken: 0.57 seconds, Fetched: 142 row(s)
{code}
3. The workaround is to put "acct.brn is not null" to join condition
{code}
select
acct.ACC_N,
acct.brn
FROM L
JOIN LA ON L.id = LA.loan_id
JOIN FR ON L.id = FR.loan_id
JOIN A ON LA.aid = A.id
JOIN PI ON PI.id = LA.pi_id
JOIN acct ON A.id = acct.aid and acct.brn is not null
WHERE
L.id = 4436;
OK
10 122
Time taken: 23.479 seconds, Fetched: 1 row(s)
{code}
I tried it on hive-1.3.0 (MR) and hive-0.13.1 (MR and Tez) - all combinations
have the issue
was:
The result from the following SELECT query is 3 rows but it should be 1 row.
I checked it in MySQL - it returned 1 row.
To reproduce the issue in Hive
1. prepare tables
{code}
drop table if exists L;
drop table if exists LA;
drop table if exists FR;
drop table if exists A;
drop table if exists PI;
drop table if exists acct;
create table L as select 4436 id;
create table LA as select 4436 loan_id, 4748 aid, 4415 pi_id;
create table FR as select 4436 loan_id;
create table A as select 4748 id;
create table PI as select 4415 id;
create table acct as select 4748 aid, 10 acc_n, 122 brn;
insert into table acct values(4748, null, null);
insert into table acct values(4748, null, null);
{code}
2. run SELECT query
{code}
select
acct.ACC_N,
acct.brn
FROM L
JOIN LA ON L.id = LA.loan_id
JOIN FR ON L.id = FR.loan_id
JOIN A ON LA.aid = A.id
JOIN PI ON PI.id = LA.pi_id
JOIN acct ON A.id = acct.aid
WHERE
L.id = 4436
and acct.brn is not null;
{code}
the result is 3 rows
{code}
10 122
NULL NULL
NULL NULL
{code}
but it should be 1 row
{code}
10 122
{code}
3. The workaround is to put "acct.brn is not null" to join condition
{code}
select
acct.ACC_N,
acct.brn
FROM L
JOIN LA ON L.id = LA.loan_id
JOIN FR ON L.id = FR.loan_id
JOIN A ON LA.aid = A.id
JOIN PI ON PI.id = LA.pi_id
JOIN acct ON A.id = acct.aid and acct.brn is not null
WHERE
L.id = 4436;
OK
10 122
Time taken: 23.479 seconds, Fetched: 1 row(s)
{code}
I tried it on hive-1.3.0 (MR) and hive-0.13.1 (MR and Tez) - all combinations
have the issue
> [WHERE col is not null] does not work sometimes for queries with many JOIN
> statements
> -------------------------------------------------------------------------------------
>
> Key: HIVE-10841
> URL: https://issues.apache.org/jira/browse/HIVE-10841
> Project: Hive
> Issue Type: Bug
> Components: Query Processor
> Reporter: Alexander Pivovarov
>
> The result from the following SELECT query is 3 rows but it should be 1 row.
> I checked it in MySQL - it returned 1 row.
> To reproduce the issue in Hive
> 1. prepare tables
> {code}
> drop table if exists L;
> drop table if exists LA;
> drop table if exists FR;
> drop table if exists A;
> drop table if exists PI;
> drop table if exists acct;
> create table L as select 4436 id;
> create table LA as select 4436 loan_id, 4748 aid, 4415 pi_id;
> create table FR as select 4436 loan_id;
> create table A as select 4748 id;
> create table PI as select 4415 id;
> create table acct as select 4748 aid, 10 acc_n, 122 brn;
> insert into table acct values(4748, null, null);
> insert into table acct values(4748, null, null);
> {code}
> 2. run SELECT query
> {code}
> select
> acct.ACC_N,
> acct.brn
> FROM L
> JOIN LA ON L.id = LA.loan_id
> JOIN FR ON L.id = FR.loan_id
> JOIN A ON LA.aid = A.id
> JOIN PI ON PI.id = LA.pi_id
> JOIN acct ON A.id = acct.aid
> WHERE
> L.id = 4436
> and acct.brn is not null;
> {code}
> the result is 3 rows
> {code}
> 10 122
> NULL NULL
> NULL NULL
> {code}
> but it should be 1 row
> {code}
> 10 122
> {code}
> 2.1 "explain select ..." output
> {code}
> STAGE DEPENDENCIES:
> Stage-12 is a root stage
> Stage-9 depends on stages: Stage-12
> Stage-0 depends on stages: Stage-9
> STAGE PLANS:
> Stage: Stage-12
> Map Reduce Local Work
> Alias -> Map Local Tables:
> a
> Fetch Operator
> limit: -1
> acct
> Fetch Operator
> limit: -1
> fr
> Fetch Operator
> limit: -1
> l
> Fetch Operator
> limit: -1
> pi
> Fetch Operator
> limit: -1
> Alias -> Map Local Operator Tree:
> a
> TableScan
> alias: a
> Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column
> stats: NONE
> Filter Operator
> predicate: id is not null (type: boolean)
> Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE
> Column stats: NONE
> HashTable Sink Operator
> keys:
> 0 _col5 (type: int)
> 1 id (type: int)
> 2 aid (type: int)
> acct
> TableScan
> alias: acct
> Statistics: Num rows: 3 Data size: 31 Basic stats: COMPLETE
> Column stats: NONE
> Filter Operator
> predicate: aid is not null (type: boolean)
> Statistics: Num rows: 2 Data size: 20 Basic stats: COMPLETE
> Column stats: NONE
> HashTable Sink Operator
> keys:
> 0 _col5 (type: int)
> 1 id (type: int)
> 2 aid (type: int)
> fr
> TableScan
> alias: fr
> Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column
> stats: NONE
> Filter Operator
> predicate: (loan_id = 4436) (type: boolean)
> Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE
> Column stats: NONE
> HashTable Sink Operator
> keys:
> 0 4436 (type: int)
> 1 4436 (type: int)
> 2 4436 (type: int)
> l
> TableScan
> alias: l
> Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column
> stats: NONE
> Filter Operator
> predicate: (id = 4436) (type: boolean)
> Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE
> Column stats: NONE
> HashTable Sink Operator
> keys:
> 0 4436 (type: int)
> 1 4436 (type: int)
> 2 4436 (type: int)
> pi
> TableScan
> alias: pi
> Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column
> stats: NONE
> Filter Operator
> predicate: id is not null (type: boolean)
> Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE
> Column stats: NONE
> HashTable Sink Operator
> keys:
> 0 _col6 (type: int)
> 1 id (type: int)
> Stage: Stage-9
> Map Reduce
> Map Operator Tree:
> TableScan
> alias: la
> Statistics: Num rows: 1 Data size: 14 Basic stats: COMPLETE
> Column stats: NONE
> Filter Operator
> predicate: (((loan_id is not null and aid is not null) and
> pi_id is not null) and (loan_id = 4436)) (type: boolean)
> Statistics: Num rows: 1 Data size: 14 Basic stats: COMPLETE
> Column stats: NONE
> Map Join Operator
> condition map:
> Inner Join 0 to 1
> Inner Join 0 to 2
> keys:
> 0 4436 (type: int)
> 1 4436 (type: int)
> 2 4436 (type: int)
> outputColumnNames: _col5, _col6
> Statistics: Num rows: 2 Data size: 8 Basic stats: COMPLETE
> Column stats: NONE
> Map Join Operator
> condition map:
> Inner Join 0 to 1
> Inner Join 1 to 2
> keys:
> 0 _col5 (type: int)
> 1 id (type: int)
> 2 aid (type: int)
> outputColumnNames: _col6, _col19, _col20
> Statistics: Num rows: 4 Data size: 17 Basic stats: COMPLETE
> Column stats: NONE
> Map Join Operator
> condition map:
> Inner Join 0 to 1
> keys:
> 0 _col6 (type: int)
> 1 id (type: int)
> outputColumnNames: _col19, _col20
> Statistics: Num rows: 4 Data size: 18 Basic stats:
> COMPLETE Column stats: NONE
> Select Operator
> expressions: _col19 (type: int), _col20 (type: int)
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 4 Data size: 18 Basic stats:
> COMPLETE Column stats: NONE
> File Output Operator
> compressed: false
> Statistics: Num rows: 4 Data size: 18 Basic stats:
> COMPLETE Column stats: NONE
> table:
> input format:
> org.apache.hadoop.mapred.TextInputFormat
> output format:
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> serde:
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> Local Work:
> Map Reduce Local Work
> Stage: Stage-0
> Fetch Operator
> limit: -1
> Processor Tree:
> ListSink
> Time taken: 0.57 seconds, Fetched: 142 row(s)
> {code}
> 3. The workaround is to put "acct.brn is not null" to join condition
> {code}
> select
> acct.ACC_N,
> acct.brn
> FROM L
> JOIN LA ON L.id = LA.loan_id
> JOIN FR ON L.id = FR.loan_id
> JOIN A ON LA.aid = A.id
> JOIN PI ON PI.id = LA.pi_id
> JOIN acct ON A.id = acct.aid and acct.brn is not null
> WHERE
> L.id = 4436;
> OK
> 10 122
> Time taken: 23.479 seconds, Fetched: 1 row(s)
> {code}
> I tried it on hive-1.3.0 (MR) and hive-0.13.1 (MR and Tez) - all combinations
> have the issue
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)