Tim Armstrong has uploaded a new patch set (#3). ( http://gerrit.cloudera.org:8080/16942 )
Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are present ...................................................................... IMPALA-10296: Fix analytic limit pushdown when predicates are present This fixes the analytic push down optimization for the case when where the ORDER BY expressions are compatible with the partitioning of the analytic *and* there is a rank() or row_number() predicate. In this case the rows returned are going to come from the first partitions, i.e. if the limit is 100, if we go through the partitions in order until the row count adds up to 100, then we know that the rows must come from those partitions. The problem is that predicates can discard rows from the partitions, meaning that a limit naively pushed down to the top-n will filter out rows that could be returned from the query. We can avoid the problem in the case where the partition limit >= limit order by limit, however. In this case the relevant set of partitions is the set of partitions that include the first <limit> rows, since the top-level limit generally kicks in before the per-partition limit. The only twist is that the orderings may be different within a partition, so we need to make sure to include all of the rows in the final partition. The solution implemented in this patch is to increase the pushed down limit so that it was always guaranteed to include all of the rows in the final partition to be returned. E.g. if you had a row_number() <= 100 predicate and limit 100, if you pushed down limit 200, then you'd be guaranteed to capture all of the rows in the final partition. One case we need to handle is that, in the case of a rank() predicate, we can have more than that number of rows in the partition because of ties. This patch implements tie handling in the backend (I took most of that implementation from my in-progress partitioned top-n patch, with the intention of rebasing that onto this patch). TODO: apply top-n threshold to avoid increasing limit enormously Testing: * Add new planner test with negative case where it's rejected * Update other planner tests to reflect new limits + tie handling * Add some end-to-end tests that repro bugs TODO: * Add planner tests for edge cases with limits - =, <, etc * Add planner test for row_number() * Add planner test for very high rank predicate that overflows int * Add some end-to-end tests that stress tie-handling more * test perf - q67 Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509 --- M be/src/exec/topn-node-ir.cc M be/src/exec/topn-node.cc M be/src/exec/topn-node.h M be/src/util/tuple-row-compare.h M common/thrift/PlanNodes.thrift M fe/src/main/java/org/apache/impala/analysis/AnalyticExpr.java M fe/src/main/java/org/apache/impala/analysis/AnalyticWindow.java M fe/src/main/java/org/apache/impala/analysis/SlotRef.java M fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java M fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java M fe/src/main/java/org/apache/impala/planner/SortNode.java M testdata/workloads/functional-planner/queries/PlannerTest/limit-pushdown-analytic.test M testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q67.test A testdata/workloads/functional-query/queries/limit-pushdown-analytic.test M tests/query_test/test_limit_pushdown_analytic.py 17 files changed, 798 insertions(+), 335 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/42/16942/3 -- To view, visit http://gerrit.cloudera.org:8080/16942 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509 Gerrit-Change-Number: 16942 Gerrit-PatchSet: 3 Gerrit-Owner: Tim Armstrong <[email protected]>
