[
https://issues.apache.org/jira/browse/IMPALA-13262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17875693#comment-17875693
]
ASF subversion and git services commented on IMPALA-13262:
----------------------------------------------------------
Commit 07d44b7affa0652f14d8044cd7ffa604f250b77b in impala's branch
refs/heads/master from Fang-Yu Rao
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=07d44b7af ]
IMPALA-13262: Do not always migrate inferred predicates into inline view
This patch removes a predicate inferred from a set of analytic
predicates if both sides of the inferred predicate reference the same
TupleId when migrating those analytic predicates into an inline view.
This is to prevent Impala from pushing the inferred conjunct to the
scan node before the analytic functions are applied, which could produce
an incorrect result.
Testing:
- Added additional query and planner test cases to verify Impala's
behavior after this patch.
- Verified the patch passed the core tests.
Change-Id: I6e2632b3b1a140ae0104ceba4e2f474ac1bbcda1
Reviewed-on: http://gerrit.cloudera.org:8080/21688
Reviewed-by: Michael Smith <[email protected]>
Reviewed-by: Riza Suminto <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>
> Predicate pushdown causes incorrect results in join condition
> -------------------------------------------------------------
>
> Key: IMPALA-13262
> URL: https://issues.apache.org/jira/browse/IMPALA-13262
> Project: IMPALA
> Issue Type: Bug
> Reporter: Fang-Yu Rao
> Assignee: Fang-Yu Rao
> Priority: Major
> Labels: correctness
>
> We found that in some scenario Apache Impala
> ([https://github.com/apache/impala/commit/c539874]) could incorrectly push
> predicates to scan nodes, which in turn produces the wrong result. The
> following is a concrete example to reproduce the issue.
> {code:sql}
> create database impala_13262;
> use impala_13262;
> create table department ( dept_no integer, dept_rank integer, start_date
> timestamp,end_date timestamp);
> insert into department values(1,1,'2024-01-01','2024-01-02');
> insert into department values(1,2,'2024-01-02','2024-01-03');
> insert into department values(1,3,'2024-01-03','2024-01-03');
> create table employee (employee_no integer, depart_no integer);
> insert into employee values (1,1);
> // The following query should return 0 row. However Apache Impala produces
> one row.
> select * from employee t1
> inner join (
> select * from
> (
> select dept_no,dept_rank,start_date,end_date
> ,row_number() over(partition by dept_no order by dept_rank) rn
> from department
> ) t2
> where rn=1
> ) t2
> on t1.depart_no=t2.dept_no
> where t2.start_date=t2.end_date;
> set explain_level=2;
> // In the output of the EXPLAIN statement, we found that the predicate
> "start_data = end_date" was pushed
> // down to the scan node, which is wrong.
> | 01:SCAN HDFS [impala_13262.department, RANDOM]
> |
> | HDFS partitions=1/1 files=3 size=132B
> |
> | predicates: start_date = end_date
> |
> | stored statistics:
> |
> | table: rows=unavailable size=unavailable
> |
> | columns: unavailable
> |
> | extrapolated-rows=disabled max-scan-range-rows=unavailable
> |
> | mem-estimate=32.00MB mem-reservation=8.00KB thread-reservation=1
> |
> | tuple-ids=1 row-size=40B cardinality=1
> |
> | in pipelines: 01(GETNEXT)
> |
> +-------------------------------------------------------------------------------------------------------+
> {code}
>
> +*Edit:*+
> The following is a smaller case to reproduce the issue. The correct result
> should be 0 row but Impala returns 1 row as above.
> {code:java}
> select * from
> (
> select dept_no,dept_rank,start_date,end_date
> ,row_number() over(partition by dept_no order by dept_rank) rn
> from department
> ) t2
> where rn=1 and t2.start_date=t2.end_date;
> {code}
>
> Recall the contents of the inline view '{*}t2{*}' above is as follows.
> {code:java}
> +---------+-----------+---------------------+---------------------+----+
> | dept_no | dept_rank | start_date | end_date | rn |
> +---------+-----------+---------------------+---------------------+----+
> | 1 | 1 | 2024-01-01 00:00:00 | 2024-01-02 00:00:00 | 1 |
> | 1 | 2 | 2024-01-02 00:00:00 | 2024-01-03 00:00:00 | 2 |
> | 1 | 3 | 2024-01-03 00:00:00 | 2024-01-03 00:00:00 | 3 |
> +---------+-----------+---------------------+---------------------+----+
> {code}
>
> On the other hand, the following query without the conjunct '{*}rn=1{*}'
> returns the correct result, which is the row with '{*}rn{*}' equal to *3*
> above. It almost looks like adding this '{*}rn=1{*}' predicate triggers the
> incorrect pushdown of '{*}t2.start_date=t2.end_date{*}' to the scan node of
> the table '{*}department{*}'.
> {code:java}
> select * from
> (
> select dept_no,dept_rank,start_date,end_date
> ,row_number() over(partition by dept_no order by dept_rank) rn
> from department
> ) t2
> where t2.start_date=t2.end_date;
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]