Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/18862 )
Change subject: IMPALA-11485: Pushdown LIMIT through UNION ALL and LEFT/RIGHT OUTER JOIN ...................................................................... Patch Set 7: (9 comments) http://gerrit.cloudera.org:8080/#/c/18862/7/testdata/workloads/functional-planner/queries/PlannerTest/limit-pushdown-outer-join.test File testdata/workloads/functional-planner/queries/PlannerTest/limit-pushdown-outer-join.test: http://gerrit.cloudera.org:8080/#/c/18862/7/testdata/workloads/functional-planner/queries/PlannerTest/limit-pushdown-outer-join.test@2 PS7, Line 2: Base case Can you add an even simpler test, e.g. with simpler join condition and no where clause? http://gerrit.cloudera.org:8080/#/c/18862/7/testdata/workloads/functional-planner/queries/PlannerTest/limit-pushdown-outer-join.test@5 PS7, Line 5: testtbl I think that it would be better to use a non-empty table like in functional.alltypes. Empty tables may be optimized out later, leading these tests to fail. http://gerrit.cloudera.org:8080/#/c/18862/7/testdata/workloads/functional-planner/queries/PlannerTest/limit-pushdown-outer-join.test@8 PS7, Line 8: t1.zip + t2.zip = 10 and t1.zip + t2.zip = 20 This seems to be always false to me - one day the planner may be able to deduce this, so I would prefer a condition that can be true. http://gerrit.cloudera.org:8080/#/c/18862/7/testdata/workloads/functional-planner/queries/PlannerTest/limit-pushdown-outer-join.test@27 PS7, Line 27: limit: 5 I am not sure whether it was correct to push down limit in this case - the issue is with the predicates in WHERE clause that use columns from both t1 and t2. Let's assume that only a single row is returned from Node 01. If in node 00 the first 5 rows pass a1.id>0, but will fail t1.zip + t2.zip=10 with the row from the right side, then the query will return 0 rows. Meanwhile it is possible that there was another row in the table that would satisfy all predicates. http://gerrit.cloudera.org:8080/#/c/18862/7/testdata/workloads/functional-planner/queries/PlannerTest/limit-pushdown-outer-join.test@133 PS7, Line 133: | |--05:SCAN HDFS [functional.testtbl] : | | HDFS partitions=1/1 files=0 size=0B : | | row-size=24B cardinality=0 : | | : | 04:SCAN HDFS [functional.testtbl] : | HDFS partitions=1/1 files=0 size=0B : | row-size=24B cardinality=0 Why didn't we push down the limit to these scan nodes? http://gerrit.cloudera.org:8080/#/c/18862/7/testdata/workloads/functional-planner/queries/PlannerTest/limit-pushdown-union.test File testdata/workloads/functional-planner/queries/PlannerTest/limit-pushdown-union.test: http://gerrit.cloudera.org:8080/#/c/18862/7/testdata/workloads/functional-planner/queries/PlannerTest/limit-pushdown-union.test@2 PS7, Line 2: Base case. Can you add 1-2 more complex tests? Some ideas: - multiple UNION ALL in a query - UNION ALL with a constant subquery like SELECT 1 (a potential optimization would be to subtract the constant rows from the LIMIT) - UNION ALL with both predicates in the outer query and in the inner query http://gerrit.cloudera.org:8080/#/c/18862/7/testdata/workloads/functional-planner/queries/PlannerTest/topn.test File testdata/workloads/functional-planner/queries/PlannerTest/topn.test: http://gerrit.cloudera.org:8080/#/c/18862/7/testdata/workloads/functional-planner/queries/PlannerTest/topn.test@464 PS7, Line 464: # test that top-n is not placed below an unpartitioned exchange with a limit The optimization led to removing the exchange node, so this test no longer checks what it should (according to this comment). Maybe increasing the limit could lead to keeping the distributed plan? http://gerrit.cloudera.org:8080/#/c/18862/7/testdata/workloads/functional-query/queries/QueryTest/joins.test File testdata/workloads/functional-query/queries/QueryTest/joins.test: http://gerrit.cloudera.org:8080/#/c/18862/7/testdata/workloads/functional-query/queries/QueryTest/joins.test@483 PS7, Line 483: order by a.int_col Why is this "order by" needed here? http://gerrit.cloudera.org:8080/#/c/18862/7/tests/query_test/test_observability.py File tests/query_test/test_observability.py: http://gerrit.cloudera.org:8080/#/c/18862/7/tests/query_test/test_observability.py@251 PS7, Line 251: # There are 3 exchange nodes and each appears in the profile 2 times (for 1 fragment : # instance + the averaged fragment). : assert results.runtime_profile.count("EXCHANGE_NODE") == 8 The comment and the assert no longer match. Also, I am curious why was the number of exchange nodes increased. -- To view, visit http://gerrit.cloudera.org:8080/18862 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ia5d040c0a98e60639d7ce4b25ecf07a859c8a32c Gerrit-Change-Number: 18862 Gerrit-PatchSet: 7 Gerrit-Owner: Baike Xia <[email protected]> Gerrit-Reviewer: Aman Sinha <[email protected]> Gerrit-Reviewer: Baike Xia <[email protected]> Gerrit-Reviewer: Csaba Ringhofer <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Jian Zhang <[email protected]> Gerrit-Reviewer: Quanlong Huang <[email protected]> Gerrit-Comment-Date: Wed, 19 Oct 2022 15:27:58 +0000 Gerrit-HasComments: Yes
