Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/18862 )

Change subject: IMPALA-11485: Pushdown LIMIT through UNION ALL and LEFT/RIGHT 
OUTER JOIN
......................................................................


Patch Set 7:

(9 comments)

http://gerrit.cloudera.org:8080/#/c/18862/7/testdata/workloads/functional-planner/queries/PlannerTest/limit-pushdown-outer-join.test
File 
testdata/workloads/functional-planner/queries/PlannerTest/limit-pushdown-outer-join.test:

http://gerrit.cloudera.org:8080/#/c/18862/7/testdata/workloads/functional-planner/queries/PlannerTest/limit-pushdown-outer-join.test@2
PS7, Line 2: Base case
Can you add an even simpler test, e.g. with simpler join condition and no where 
clause?


http://gerrit.cloudera.org:8080/#/c/18862/7/testdata/workloads/functional-planner/queries/PlannerTest/limit-pushdown-outer-join.test@5
PS7, Line 5: testtbl
I think that it would be better to use a non-empty table like in 
functional.alltypes. Empty tables may be optimized out later, leading these 
tests to fail.


http://gerrit.cloudera.org:8080/#/c/18862/7/testdata/workloads/functional-planner/queries/PlannerTest/limit-pushdown-outer-join.test@8
PS7, Line 8: t1.zip + t2.zip = 10 and t1.zip + t2.zip = 20
This seems to be always false to me - one day the planner may be able to deduce 
this, so I would prefer a condition that can be true.


http://gerrit.cloudera.org:8080/#/c/18862/7/testdata/workloads/functional-planner/queries/PlannerTest/limit-pushdown-outer-join.test@27
PS7, Line 27:    limit: 5
I am not sure whether it was correct to push down limit in this case - the 
issue is with the predicates in WHERE clause that use columns from both t1 and 
t2.

Let's assume that only a single row is returned from Node 01.
If in node 00 the first 5 rows pass a1.id>0, but will fail t1.zip + t2.zip=10 
with the row from the right side, then the query will return 0 rows. Meanwhile 
it is possible that there was another row in the table that would satisfy all 
predicates.


http://gerrit.cloudera.org:8080/#/c/18862/7/testdata/workloads/functional-planner/queries/PlannerTest/limit-pushdown-outer-join.test@133
PS7, Line 133: |  |--05:SCAN HDFS [functional.testtbl]
             : |  |     HDFS partitions=1/1 files=0 size=0B
             : |  |     row-size=24B cardinality=0
             : |  |
             : |  04:SCAN HDFS [functional.testtbl]
             : |     HDFS partitions=1/1 files=0 size=0B
             : |     row-size=24B cardinality=0
Why didn't we push down the limit to these scan nodes?


http://gerrit.cloudera.org:8080/#/c/18862/7/testdata/workloads/functional-planner/queries/PlannerTest/limit-pushdown-union.test
File 
testdata/workloads/functional-planner/queries/PlannerTest/limit-pushdown-union.test:

http://gerrit.cloudera.org:8080/#/c/18862/7/testdata/workloads/functional-planner/queries/PlannerTest/limit-pushdown-union.test@2
PS7, Line 2: Base case.
Can you add 1-2 more complex tests? Some ideas:
- multiple UNION ALL in a query
- UNION ALL with a constant subquery like SELECT 1 (a potential optimization 
would be to subtract the constant rows from the LIMIT)
- UNION ALL with both predicates in the outer query and in the inner query


http://gerrit.cloudera.org:8080/#/c/18862/7/testdata/workloads/functional-planner/queries/PlannerTest/topn.test
File testdata/workloads/functional-planner/queries/PlannerTest/topn.test:

http://gerrit.cloudera.org:8080/#/c/18862/7/testdata/workloads/functional-planner/queries/PlannerTest/topn.test@464
PS7, Line 464: # test that top-n is not placed below an unpartitioned exchange 
with a limit
The optimization led to removing the exchange node, so this test no longer 
checks what it should (according to this comment). Maybe increasing the limit 
could lead to keeping the distributed plan?


http://gerrit.cloudera.org:8080/#/c/18862/7/testdata/workloads/functional-query/queries/QueryTest/joins.test
File testdata/workloads/functional-query/queries/QueryTest/joins.test:

http://gerrit.cloudera.org:8080/#/c/18862/7/testdata/workloads/functional-query/queries/QueryTest/joins.test@483
PS7, Line 483: order by a.int_col
Why is this "order by" needed here?


http://gerrit.cloudera.org:8080/#/c/18862/7/tests/query_test/test_observability.py
File tests/query_test/test_observability.py:

http://gerrit.cloudera.org:8080/#/c/18862/7/tests/query_test/test_observability.py@251
PS7, Line 251:     # There are 3 exchange nodes and each appears in the profile 
2 times (for 1 fragment
             :     # instance + the averaged fragment).
             :     assert results.runtime_profile.count("EXCHANGE_NODE") == 8
The comment and the assert no longer match. Also, I am curious why was the 
number of exchange nodes increased.



--
To view, visit http://gerrit.cloudera.org:8080/18862
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia5d040c0a98e60639d7ce4b25ecf07a859c8a32c
Gerrit-Change-Number: 18862
Gerrit-PatchSet: 7
Gerrit-Owner: Baike Xia <[email protected]>
Gerrit-Reviewer: Aman Sinha <[email protected]>
Gerrit-Reviewer: Baike Xia <[email protected]>
Gerrit-Reviewer: Csaba Ringhofer <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Jian Zhang <[email protected]>
Gerrit-Reviewer: Quanlong Huang <[email protected]>
Gerrit-Comment-Date: Wed, 19 Oct 2022 15:27:58 +0000
Gerrit-HasComments: Yes

Reply via email to