[
https://issues.apache.org/jira/browse/HIVE-12720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yongzhi Chen updated HIVE-12720:
--------------------------------
Description:
In certain cases hive is not pruning partitions as well as it could be, and
instead scanning the whole table.
Minimal use case:
{code:sql}
create table tbl1 (
id1 int,
ext_id2 int
)
PARTITIONED BY (ymd int);
create table tbl2 (
id2 int,
prop string
)
PARTITIONED BY (ymd int);
INSERT INTO tbl1 PARTITION(ymd=20151201) VALUES (1,1);
INSERT INTO tbl1 PARTITION(ymd=20151202) VALUES (1,2);
INSERT INTO tbl1 PARTITION(ymd=20151203) VALUES (1,3);
INSERT INTO tbl2 PARTITION(ymd=20151201) VALUES (1,'testa');
INSERT INTO tbl2 PARTITION(ymd=20151202) VALUES (2,'testb');
INSERT INTO tbl2 PARTITION(ymd=20151203) VALUES (3,'testc');
-- query 1
EXPLAIN
select * from
tbl1
LEFT OUTER JOIN
tbl2
ON
(tbl1.ext_id2 = tbl2.id2
and tbl2.ymd = tbl1.ymd)
WHERE
tbl1.ymd = 20151203
-- query2
EXPLAIN
with cte as (select * from tbl1
WHERE
tbl1.ymd = 20151203)
select * from
cte
LEFT OUTER JOIN
tbl2
ON
(cte.ext_id2 = tbl2.id2
and tbl2.ymd = cte.ymd)
WHERE
tbl2.ymd = 20151203
{code}
Query1 prunes correctly:
alias: tbl2
Statistics: Num rows: 1 Data size: 7 Basic stats: COMPLETE Column stats: NONE
Query2 does a full table scan on tbl2:
Statistics: Num rows: 4 Data size: 28 Basic stats: COMPLETE Column stats: NONE
I know both queries are not equivalent and the second query is not really a
LEFT JOIN in its current state, but I think the optimiser could do a better job
here.
This example was (re)produced with a cdh 5.5 docker image.
This issue can be reproduced when:
set hive.cbo.enable=false;
was:
In certain cases hive is not pruning partitions as well as it could be, and
instead scanning the whole table.
Minimal use case:
{code:sql}
create table tbl1 (
id1 int,
ext_id2 int
)
PARTITIONED BY (ymd int);
create table tbl2 (
id2 int,
prop string
)
PARTITIONED BY (ymd int);
INSERT INTO tbl1 PARTITION(ymd=20151201) VALUES (1,1);
INSERT INTO tbl1 PARTITION(ymd=20151202) VALUES (1,2);
INSERT INTO tbl1 PARTITION(ymd=20151203) VALUES (1,3);
INSERT INTO tbl2 PARTITION(ymd=20151201) VALUES (1,'testa');
INSERT INTO tbl2 PARTITION(ymd=20151202) VALUES (2,'testb');
INSERT INTO tbl2 PARTITION(ymd=20151203) VALUES (3,'testc');
-- query 1
EXPLAIN
select * from
tbl1
LEFT OUTER JOIN
tbl2
ON
(tbl1.ext_id2 = tbl2.id2
and tbl2.ymd = tbl1.ymd)
WHERE
tbl1.ymd = 20151203
-- query2
EXPLAIN
with cte as (select * from tbl1
WHERE
tbl1.ymd = 20151203)
select * from
cte
LEFT OUTER JOIN
tbl2
ON
(cte.ext_id2 = tbl2.id2
and tbl2.ymd = cte.ymd)
WHERE
tbl2.ymd = 20151203
{code}
Query1 prunes correctly:
alias: tbl2
Statistics: Num rows: 1 Data size: 7 Basic stats: COMPLETE Column stats: NONE
Query2 does a full table scan on tbl2:
Statistics: Num rows: 4 Data size: 28 Basic stats: COMPLETE Column stats: NONE
I know both queries are not equivalent and the second query is not really a
LEFT JOIN in its current state, but I think the optimiser could do a better job
here.
This example was (re)produced with a cdh 5.5 docker image.
> Partition pruning not happening on left join with CTE
> -----------------------------------------------------
>
> Key: HIVE-12720
> URL: https://issues.apache.org/jira/browse/HIVE-12720
> Project: Hive
> Issue Type: Bug
> Affects Versions: 1.1.0
> Reporter: Clemens Valiente
>
> In certain cases hive is not pruning partitions as well as it could be, and
> instead scanning the whole table.
> Minimal use case:
> {code:sql}
> create table tbl1 (
> id1 int,
> ext_id2 int
> )
> PARTITIONED BY (ymd int);
> create table tbl2 (
> id2 int,
> prop string
> )
> PARTITIONED BY (ymd int);
> INSERT INTO tbl1 PARTITION(ymd=20151201) VALUES (1,1);
> INSERT INTO tbl1 PARTITION(ymd=20151202) VALUES (1,2);
> INSERT INTO tbl1 PARTITION(ymd=20151203) VALUES (1,3);
> INSERT INTO tbl2 PARTITION(ymd=20151201) VALUES (1,'testa');
> INSERT INTO tbl2 PARTITION(ymd=20151202) VALUES (2,'testb');
> INSERT INTO tbl2 PARTITION(ymd=20151203) VALUES (3,'testc');
> -- query 1
> EXPLAIN
> select * from
> tbl1
> LEFT OUTER JOIN
> tbl2
> ON
> (tbl1.ext_id2 = tbl2.id2
> and tbl2.ymd = tbl1.ymd)
> WHERE
> tbl1.ymd = 20151203
> -- query2
> EXPLAIN
> with cte as (select * from tbl1
> WHERE
> tbl1.ymd = 20151203)
> select * from
> cte
> LEFT OUTER JOIN
> tbl2
> ON
> (cte.ext_id2 = tbl2.id2
> and tbl2.ymd = cte.ymd)
> WHERE
> tbl2.ymd = 20151203
> {code}
> Query1 prunes correctly:
> alias: tbl2
> Statistics: Num rows: 1 Data size: 7 Basic stats: COMPLETE Column stats: NONE
> Query2 does a full table scan on tbl2:
> Statistics: Num rows: 4 Data size: 28 Basic stats: COMPLETE Column stats:
> NONE
> I know both queries are not equivalent and the second query is not really a
> LEFT JOIN in its current state, but I think the optimiser could do a better
> job here.
> This example was (re)produced with a cdh 5.5 docker image.
> This issue can be reproduced when:
> set hive.cbo.enable=false;
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)