[
https://issues.apache.org/jira/browse/HIVE-10841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14571849#comment-14571849
]
Alexander Pivovarov commented on HIVE-10841:
--------------------------------------------
LOG info for the queries with different JOIN operators order
L, LA, FR, A, PI, acct - only 2 log messages from ppd.OpProcFactory contain "=
120"
{code}
explain
select
acct.ACC_N,
acct.brn
FROM L
JOIN LA ON L.id = LA.loan_id
JOIN FR ON L.id = FR.loan_id
JOIN A ON LA.aid = A.id
JOIN PI ON PI.id = LA.pi_id
JOIN acct ON A.id = acct.aid
WHERE
L.id = 4436
and acct.brn = 120;
15/06/03 16:31:47 [main]: INFO ppd.OpProcFactory: Processing for FIL(25)
15/06/03 16:31:47 [main]: INFO ppd.OpProcFactory: Pushdown Predicates of FIL
For Alias : acct
15/06/03 16:31:47 [main]: INFO ppd.OpProcFactory: (_col20 = 120)
15/06/03 16:31:47 [main]: INFO ppd.OpProcFactory: Processing for JOIN(24)
15/06/03 16:31:47 [main]: INFO ppd.OpProcFactory: Pushdown Predicates of JOIN
For Alias : acct
15/06/03 16:31:47 [main]: INFO ppd.OpProcFactory: (VALUE._col19 = 120)
Stage: Stage-9
Map Reduce
Select Operator
expressions: _col19 (type: int), 120 (type: int)
{code}
L, LA, FR, A, acct, PI - 8 log lines from ppd.OpProcFactory contain "= 120"
{code}
explain
select
acct.ACC_N,
acct.brn
FROM L
JOIN LA ON L.id = LA.loan_id
JOIN FR ON L.id = FR.loan_id
JOIN A ON LA.aid = A.id
JOIN acct ON A.id = acct.aid
JOIN PI ON PI.id = LA.pi_id
WHERE
L.id = 4436
and acct.brn = 120;
15/06/03 15:45:25 [main]: INFO ppd.OpProcFactory: Processing for FIL(25)
15/06/03 15:45:39 [main]: INFO ppd.OpProcFactory: Pushdown Predicates of FIL
For Alias : acct
15/06/03 15:45:39 [main]: INFO ppd.OpProcFactory: (_col20 = 120)
15/06/03 15:46:23 [main]: INFO ppd.OpProcFactory: Processing for JOIN(24)
15/06/03 15:46:23 [main]: INFO ppd.OpProcFactory: Pushdown Predicates of JOIN
For Alias : acct
15/06/03 15:46:23 [main]: INFO ppd.OpProcFactory: (VALUE._col19 = 120)
15/06/03 15:46:26 [main]: INFO ppd.OpProcFactory: Processing for RS(21)
15/06/03 15:46:26 [main]: INFO ppd.OpProcFactory: Pushdown Predicates of RS For
Alias : acct
15/06/03 15:46:26 [main]: INFO ppd.OpProcFactory: (_col20 = 120)
15/06/03 15:46:43 [main]: INFO ppd.OpProcFactory: Processing for FIL(20)
15/06/03 15:46:49 [main]: INFO ppd.OpProcFactory: Pushdown Predicates of FIL
For Alias : acct
15/06/03 15:46:49 [main]: INFO ppd.OpProcFactory: (_col20 = 120)
15/06/03 15:46:52 [main]: INFO ppd.OpProcFactory: Processing for JOIN(19)
15/06/03 15:46:52 [main]: INFO ppd.OpProcFactory: Pushdown Predicates of JOIN
For Alias : acct
15/06/03 15:46:52 [main]: INFO ppd.OpProcFactory: (VALUE._col1 = 120)
15/06/03 15:59:18 [main]: INFO ppd.OpProcFactory: Processing for RS(18)
15/06/03 15:59:18 [main]: INFO ppd.OpProcFactory: Pushdown Predicates of RS For
Alias : acct
15/06/03 15:59:18 [main]: INFO ppd.OpProcFactory: (brn = 120)
15/06/03 15:59:19 [main]: INFO ppd.OpProcFactory: Processing for FIL(17)
15/06/03 15:59:50 [main]: INFO ppd.OpProcFactory: Pushdown Predicates of FIL
For Alias : acct
15/06/03 15:59:50 [main]: INFO ppd.OpProcFactory: (brn = 120)
15/06/03 16:00:20 [main]: INFO ppd.OpProcFactory: Processing for TS(4)
15/06/03 16:00:20 [main]: INFO ppd.OpProcFactory: Pushdown Predicates of TS For
Alias : acct
15/06/03 16:00:20 [main]: INFO ppd.OpProcFactory: aid is not null
15/06/03 16:00:20 [main]: INFO ppd.OpProcFactory: (brn = 120)
15/06/03 16:01:38 [main]: INFO optimizer.ConstantPropagateProcFactory: expr
Const int 120 fold from Column[VALUE._col19] is removed.
15/06/03 16:01:38 [main]: INFO optimizer.ColumnPrunerProcFactory: RS 21
oldColExprMap: {VALUE._col5=Column[_col5], VALUE._col4=Const int 4436,
VALUE._col3=Column[_col3], VALUE._col2=Column[_col2],
VALUE._col1=Column[_col1], VALUE._col0=Const int 4436,
KEY.reducesinkkey0=Column[_col6], VALUE._col14=Column[_col15],
VALUE._col13=Column[_col14], VALUE._col16=Column[_col17],
VALUE._col15=Column[_col16], VALUE._col18=Column[_col19], VALUE._col9=Const int
4436, VALUE._col17=Column[_col18], VALUE._col8=Column[_col9],
VALUE._col7=Column[_col8], VALUE._col19=Const int 120,
VALUE._col6=Column[_col7], VALUE._col20=Column[_col21],
VALUE._col11=Column[_col12], VALUE._col21=Column[_col22],
VALUE._col12=Column[_col13], VALUE._col22=Column[_col23],
VALUE._col10=Column[_col11]}
15/06/03 16:01:38 [main]: INFO optimizer.ColumnPrunerProcFactory: RS 18
oldColExprMap: {VALUE._col4=Column[ROW__ID],
VALUE._col3=Column[INPUT__FILE__NAME],
VALUE._col2=Column[BLOCK__OFFSET__INSIDE__FILE], VALUE._col1=Const int 120,
VALUE._col0=Column[acc_n], KEY.reducesinkkey0=Column[aid]}
STAGE PLANS:
Stage: Stage-12
acct
TableScan
alias: acct
Statistics: Num rows: 5 Data size: 63 Basic stats: COMPLETE Column
stats: NONE
Filter Operator
predicate: (aid is not null and (brn = 120)) (type: boolean
Stage: Stage-9
Map Reduce
Select Operator
expressions: _col19 (type: int), 120 (type: int)
{code}
> [WHERE col is not null] does not work sometimes for queries with many JOIN
> statements
> -------------------------------------------------------------------------------------
>
> Key: HIVE-10841
> URL: https://issues.apache.org/jira/browse/HIVE-10841
> Project: Hive
> Issue Type: Bug
> Components: Query Planning, Query Processor
> Affects Versions: 0.13.0, 0.14.0, 0.13.1, 1.2.0
> Reporter: Alexander Pivovarov
> Assignee: Alexander Pivovarov
> Attachments: HIVE-10841.patch
>
>
> The result from the following SELECT query is 3 rows but it should be 1 row.
> I checked it in MySQL - it returned 1 row.
> To reproduce the issue in Hive
> 1. prepare tables
> {code}
> drop table if exists L;
> drop table if exists LA;
> drop table if exists FR;
> drop table if exists A;
> drop table if exists PI;
> drop table if exists acct;
> create table L as select 4436 id;
> create table LA as select 4436 loan_id, 4748 aid, 4415 pi_id;
> create table FR as select 4436 loan_id;
> create table A as select 4748 id;
> create table PI as select 4415 id;
> create table acct as select 4748 aid, 10 acc_n, 122 brn;
> insert into table acct values(4748, null, null);
> insert into table acct values(4748, null, null);
> {code}
> 2. run SELECT query
> {code}
> select
> acct.ACC_N,
> acct.brn
> FROM L
> JOIN LA ON L.id = LA.loan_id
> JOIN FR ON L.id = FR.loan_id
> JOIN A ON LA.aid = A.id
> JOIN PI ON PI.id = LA.pi_id
> JOIN acct ON A.id = acct.aid
> WHERE
> L.id = 4436
> and acct.brn is not null;
> {code}
> the result is 3 rows
> {code}
> 10 122
> NULL NULL
> NULL NULL
> {code}
> but it should be 1 row
> {code}
> 10 122
> {code}
> 2.1 "explain select ..." output for hive-1.3.0 MR
> {code}
> STAGE DEPENDENCIES:
> Stage-12 is a root stage
> Stage-9 depends on stages: Stage-12
> Stage-0 depends on stages: Stage-9
> STAGE PLANS:
> Stage: Stage-12
> Map Reduce Local Work
> Alias -> Map Local Tables:
> a
> Fetch Operator
> limit: -1
> acct
> Fetch Operator
> limit: -1
> fr
> Fetch Operator
> limit: -1
> l
> Fetch Operator
> limit: -1
> pi
> Fetch Operator
> limit: -1
> Alias -> Map Local Operator Tree:
> a
> TableScan
> alias: a
> Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column
> stats: NONE
> Filter Operator
> predicate: id is not null (type: boolean)
> Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE
> Column stats: NONE
> HashTable Sink Operator
> keys:
> 0 _col5 (type: int)
> 1 id (type: int)
> 2 aid (type: int)
> acct
> TableScan
> alias: acct
> Statistics: Num rows: 3 Data size: 31 Basic stats: COMPLETE
> Column stats: NONE
> Filter Operator
> predicate: aid is not null (type: boolean)
> Statistics: Num rows: 2 Data size: 20 Basic stats: COMPLETE
> Column stats: NONE
> HashTable Sink Operator
> keys:
> 0 _col5 (type: int)
> 1 id (type: int)
> 2 aid (type: int)
> fr
> TableScan
> alias: fr
> Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column
> stats: NONE
> Filter Operator
> predicate: (loan_id = 4436) (type: boolean)
> Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE
> Column stats: NONE
> HashTable Sink Operator
> keys:
> 0 4436 (type: int)
> 1 4436 (type: int)
> 2 4436 (type: int)
> l
> TableScan
> alias: l
> Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column
> stats: NONE
> Filter Operator
> predicate: (id = 4436) (type: boolean)
> Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE
> Column stats: NONE
> HashTable Sink Operator
> keys:
> 0 4436 (type: int)
> 1 4436 (type: int)
> 2 4436 (type: int)
> pi
> TableScan
> alias: pi
> Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column
> stats: NONE
> Filter Operator
> predicate: id is not null (type: boolean)
> Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE
> Column stats: NONE
> HashTable Sink Operator
> keys:
> 0 _col6 (type: int)
> 1 id (type: int)
> Stage: Stage-9
> Map Reduce
> Map Operator Tree:
> TableScan
> alias: la
> Statistics: Num rows: 1 Data size: 14 Basic stats: COMPLETE
> Column stats: NONE
> Filter Operator
> predicate: (((loan_id is not null and aid is not null) and
> pi_id is not null) and (loan_id = 4436)) (type: boolean)
> Statistics: Num rows: 1 Data size: 14 Basic stats: COMPLETE
> Column stats: NONE
> Map Join Operator
> condition map:
> Inner Join 0 to 1
> Inner Join 0 to 2
> keys:
> 0 4436 (type: int)
> 1 4436 (type: int)
> 2 4436 (type: int)
> outputColumnNames: _col5, _col6
> Statistics: Num rows: 2 Data size: 8 Basic stats: COMPLETE
> Column stats: NONE
> Map Join Operator
> condition map:
> Inner Join 0 to 1
> Inner Join 1 to 2
> keys:
> 0 _col5 (type: int)
> 1 id (type: int)
> 2 aid (type: int)
> outputColumnNames: _col6, _col19, _col20
> Statistics: Num rows: 4 Data size: 17 Basic stats: COMPLETE
> Column stats: NONE
> Map Join Operator
> condition map:
> Inner Join 0 to 1
> keys:
> 0 _col6 (type: int)
> 1 id (type: int)
> outputColumnNames: _col19, _col20
> Statistics: Num rows: 4 Data size: 18 Basic stats:
> COMPLETE Column stats: NONE
> Select Operator
> expressions: _col19 (type: int), _col20 (type: int)
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 4 Data size: 18 Basic stats:
> COMPLETE Column stats: NONE
> File Output Operator
> compressed: false
> Statistics: Num rows: 4 Data size: 18 Basic stats:
> COMPLETE Column stats: NONE
> table:
> input format:
> org.apache.hadoop.mapred.TextInputFormat
> output format:
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> serde:
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> Local Work:
> Map Reduce Local Work
> Stage: Stage-0
> Fetch Operator
> limit: -1
> Processor Tree:
> ListSink
> Time taken: 0.57 seconds, Fetched: 142 row(s)
> {code}
> 2.2. "explain select..." output for hive-0.13.1 Tez
> {code}
> STAGE DEPENDENCIES:
> Stage-1 is a root stage
> Stage-0 is a root stage
> STAGE PLANS:
> Stage: Stage-1
> Tez
> Edges:
> Reducer 2 <- Map 1 (SIMPLE_EDGE), Map 4 (SIMPLE_EDGE), Reducer 6
> (SIMPLE_EDGE)
> Reducer 3 <- Reducer 2 (SIMPLE_EDGE), Map 9 (SIMPLE_EDGE)
> Reducer 6 <- Map 5 (SIMPLE_EDGE), Map 7 (SIMPLE_EDGE), Map 8
> (SIMPLE_EDGE)
> DagName: lcapp_20150528111717_06c57a5b-8dc6-4ce9-bce7-b9e0a7818fe4:1
> Vertices:
> Map 1
> Map Operator Tree:
> TableScan
> alias: acct
> Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE
> Column stats: NONE
> Reduce Output Operator
> key expressions: aid (type: int)
> sort order: +
> Map-reduce partition columns: aid (type: int)
> Statistics: Num rows: 1 Data size: 4 Basic stats:
> COMPLETE Column stats: NONE
> value expressions: acc_n (type: int), brn (type: int)
> Map 4
> Map Operator Tree:
> TableScan
> alias: a
> Statistics: Num rows: 46 Data size: 187 Basic stats:
> COMPLETE Column stats: NONE
> Reduce Output Operator
> key expressions: id (type: int)
> sort order: +
> Map-reduce partition columns: id (type: int)
> Statistics: Num rows: 46 Data size: 187 Basic stats:
> COMPLETE Column stats: NONE
> Map 5
> Map Operator Tree:
> TableScan
> alias: la
> Statistics: Num rows: 28 Data size: 347 Basic stats:
> COMPLETE Column stats: NONE
> Filter Operator
> predicate: (loan_id = 4436) (type: boolean)
> Statistics: Num rows: 14 Data size: 173 Basic stats:
> COMPLETE Column stats: NONE
> Reduce Output Operator
> key expressions: loan_id (type: int)
> sort order: +
> Map-reduce partition columns: loan_id (type: int)
> Statistics: Num rows: 14 Data size: 173 Basic stats:
> COMPLETE Column stats: NONE
> value expressions: aid (type: int), pi_id (type: int)
> Map 7
> Map Operator Tree:
> TableScan
> alias: fr
> Statistics: Num rows: 46 Data size: 187 Basic stats:
> COMPLETE Column stats: NONE
> Filter Operator
> predicate: (loan_id = 4436) (type: boolean)
> Statistics: Num rows: 23 Data size: 93 Basic stats:
> COMPLETE Column stats: NONE
> Reduce Output Operator
> key expressions: loan_id (type: int)
> sort order: +
> Map-reduce partition columns: loan_id (type: int)
> Statistics: Num rows: 23 Data size: 93 Basic stats:
> COMPLETE Column stats: NONE
> Map 8
> Map Operator Tree:
> TableScan
> alias: l
> Statistics: Num rows: 46 Data size: 187 Basic stats:
> COMPLETE Column stats: NONE
> Filter Operator
> predicate: (id = 4436) (type: boolean)
> Statistics: Num rows: 23 Data size: 93 Basic stats:
> COMPLETE Column stats: NONE
> Reduce Output Operator
> key expressions: id (type: int)
> sort order: +
> Map-reduce partition columns: id (type: int)
> Statistics: Num rows: 23 Data size: 93 Basic stats:
> COMPLETE Column stats: NONE
> Map 9
> Map Operator Tree:
> TableScan
> alias: pi
> Statistics: Num rows: 46 Data size: 187 Basic stats:
> COMPLETE Column stats: NONE
> Reduce Output Operator
> key expressions: id (type: int)
> sort order: +
> Map-reduce partition columns: id (type: int)
> Statistics: Num rows: 46 Data size: 187 Basic stats:
> COMPLETE Column stats: NONE
> Reducer 2
> Reduce Operator Tree:
> Join Operator
> condition map:
> Inner Join 0 to 1
> Inner Join 1 to 2
> condition expressions:
> 0 {VALUE._col2}
> 1
> 2 {VALUE._col1} {VALUE._col2}
> outputColumnNames: _col2, _col15, _col16
> Statistics: Num rows: 110 Data size: 448 Basic stats:
> COMPLETE Column stats: NONE
> Reduce Output Operator
> key expressions: _col2 (type: int)
> sort order: +
> Map-reduce partition columns: _col2 (type: int)
> Statistics: Num rows: 110 Data size: 448 Basic stats:
> COMPLETE Column stats: NONE
> value expressions: _col15 (type: int), _col16 (type: int)
> Reducer 3
> Reduce Operator Tree:
> Join Operator
> condition map:
> Inner Join 0 to 1
> condition expressions:
> 0 {VALUE._col1} {VALUE._col2}
> 1
> outputColumnNames: _col1, _col2
> Statistics: Num rows: 121 Data size: 492 Basic stats:
> COMPLETE Column stats: NONE
> Select Operator
> expressions: _col1 (type: int), _col2 (type: int)
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 121 Data size: 492 Basic stats:
> COMPLETE Column stats: NONE
> File Output Operator
> compressed: false
> Statistics: Num rows: 121 Data size: 492 Basic stats:
> COMPLETE Column stats: NONE
> table:
> input format: org.apache.hadoop.mapred.TextInputFormat
> output format:
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> serde:
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> Reducer 6
> Reduce Operator Tree:
> Join Operator
> condition map:
> Inner Join 0 to 1
> Inner Join 0 to 2
> condition expressions:
> 0
> 1 {VALUE._col1} {VALUE._col2}
> 2
> outputColumnNames: _col4, _col5
> Statistics: Num rows: 50 Data size: 204 Basic stats: COMPLETE
> Column stats: NONE
> Reduce Output Operator
> key expressions: _col4 (type: int)
> sort order: +
> Map-reduce partition columns: _col4 (type: int)
> Statistics: Num rows: 50 Data size: 204 Basic stats:
> COMPLETE Column stats: NONE
> value expressions: _col5 (type: int)
> Stage: Stage-0
> Fetch Operator
> limit: -1
> Time taken: 1.377 seconds, Fetched: 146 row(s)
> {code}
> 3. The workaround is to put "acct.brn is not null" to join condition
> {code}
> select
> acct.ACC_N,
> acct.brn
> FROM L
> JOIN LA ON L.id = LA.loan_id
> JOIN FR ON L.id = FR.loan_id
> JOIN A ON LA.aid = A.id
> JOIN PI ON PI.id = LA.pi_id
> JOIN acct ON A.id = acct.aid and acct.brn is not null
> WHERE
> L.id = 4436;
> OK
> 10 122
> Time taken: 23.479 seconds, Fetched: 1 row(s)
> {code}
> I tried it on hive-1.3.0 (MR) and hive-0.13.1 (MR and Tez) - all combinations
> have the issue
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)