Alex Behm has posted comments on this change. Change subject: IMPALA-4731/IMPALA-397/IMPALA-4728: Materialize sort exprs ......................................................................
Patch Set 4: (17 comments) http://gerrit.cloudera.org:8080/#/c/6322/2/fe/src/main/java/org/apache/impala/analysis/SortInfo.java File fe/src/main/java/org/apache/impala/analysis/SortInfo.java: Line 240: materializedDesc.setIsMaterialized(true); > As discussed, I removed the handling of literals. Ahh right, let me explain just to close on this: select r from (select rand() r from t) v order by r Here "r" in the outer query block is a SlotRef into the virtual inline-view tuple. So we will just materialize that SlotRef. On the other hand, it means our cost-based materialization decision may not be accurate through inline views (but that's a much less severe issue). For example, select c from (select c1 + c2 c from t) v order by c we will likewise materialize "c" even though the underlying expr "c1 + c2" is cheap according to our costing. Still sounds worth filing a JIRA for making this decision more accurate (i.e., during planning), but I think we can definitely consider all JIRAs in the commit msg fixed after this patch :) http://gerrit.cloudera.org:8080/#/c/6322/4/fe/src/main/java/org/apache/impala/analysis/SortInfo.java File fe/src/main/java/org/apache/impala/analysis/SortInfo.java: Line 39: // All Exprs with cost greater than this will be materialized. Since we don't currently All ordering exprs with cost... Line 40: // have any information about actual function costs, this value guarantees that all guarantees -> is intended to ensure Line 42: // operations unmaterialized, for example 'SlotRef + SlotRef' would have cost 3 and not remove cost "3" since this might change and it will be hard to tell this comment needs to be updated Line 167: * output by the sort node. Done by first calling createMaterailzedOrderExprs() to The "Done by" sentence sounds a little procedural/complicated, can you condense/simplify it? Line 170: * expressions. The materialized exprs are substituted with slot refs into the new exprs Line 181: // substOrderBy is a mapping from: Can be improved for clarity. Also point 2 is not quite accurate. Here's a proposal: // Mapping from exprs evaluated on the sort input that get materialized into the sort tuple to their corresponding SlotRefs in the sort tuple. The following exprs are materialized: 1. Ordering exprs that we chose to materialize 2. SlotRefs against the sort input contained in the result and ordering exprs after substituting the materialized ordering exprs Line 191: Set<SlotRef> sourceSlots = Sets.newHashSet(); // SlotRefs in the result and ordering exprs after substituting the materialized ordering exprs. Line 201: if (!substOrderBy.getRhs().contains(origSlotRef)) { contains() is expensive (and every iteration might grow the rhs). I suggest checking whether a SlotRef references the sort tuple id. I know we use contains() elsewhere extensively, but mostly because we have no simple and cheap alternative. Here the alternative is pretty simple. Line 222: * Materialize ordering exprs that are non-deterministic, contain a UDF (since we don't Easier to read if you have bullet points for the things we materialize, e.g.: Materialize the following ordering exprs: - non-deterministic exprs - exprs that contain a UDF (since ...) http://gerrit.cloudera.org:8080/#/c/6322/4/testdata/workloads/functional-planner/queries/PlannerTest/sort-expr-materialization.test File testdata/workloads/functional-planner/queries/PlannerTest/sort-expr-materialization.test: Line 1: #sort on a non-deterministic expr, gets materialized space after # Line 118: # expensive order expr materialized in merging exchange Same as case above? We don't materialize in a merging exchange, the preceding sort always does that. Line 139: # sort on udf, gets materialized Need additional tests to cover the tricky code to avoid redundant expr materializations. Examples for inspiration: select concat(a, b, c, d) c from t order by c http://gerrit.cloudera.org:8080/#/c/6322/4/tests/query_test/test_sort.py File tests/query_test/test_sort.py: Line 158: class TestRandomSort(ImpalaTestSuite): This will run over all test vectors right? I think it makes sense to only run this over text/none or parquet/none. You can try running this with exhaustive and see how often this test is run. Line 172: # Include "random()" in the select list to check that its sorted correctly. it's Line 174: "select random() as r from functional.alltypessmall order by r").data, also add one with an inline view: select r from (select random() r functional.alltypessmall) v order by r Line 185: """Tests that a window function over 'order by random()' worked as expected.""" works as expected -- To view, visit http://gerrit.cloudera.org:8080/6322 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ifefdaff8557a30ac44ea82ed428e6d1ffbca2e9e Gerrit-PatchSet: 4 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Thomas Tauber-Marshall <[email protected]> Gerrit-Reviewer: Alex Behm <[email protected]> Gerrit-Reviewer: Marcel Kornacker <[email protected]> Gerrit-Reviewer: Thomas Tauber-Marshall <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-HasComments: Yes
