Thomas Tauber-Marshall has posted comments on this change. Change subject: IMPALA-4731/IMPALA-397/IMPALA-4728: Materialize sort exprs ......................................................................
Patch Set 2: (24 comments) http://gerrit.cloudera.org:8080/#/c/6322/1/fe/src/main/java/org/apache/impala/analysis/Expr.java File fe/src/main/java/org/apache/impala/analysis/Expr.java: Line 212: // True if this expr always returns the same value given the same input, false if it > True if this exprs always returns the same value given the same input. Fals Done Line 213: // does not, or if the behavior is unknown (e.g. UDFs). Set at the end of analyze() and > e.g. Done http://gerrit.cloudera.org:8080/#/c/6322/1/fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java File fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java: Line 292: // We currently don't have a way to indicate if a UDF is deterministic, so just always > Ideally we should not conflate the UDF and deterministic concepts. For exam Not sure what you mean by 'two separate members". Maybe Expr.containsNonDeterministicBuiltin and Expr.containsUDF? Also, it doesn't seem like FunctionCallExpr.isNondeterministicBuiltinFn() and Expr.isDeterministic() could ever be guaranteed to return opposite values, since a deterministic function that has non-deterministic parameters would constitute a non-deterministic expr. http://gerrit.cloudera.org:8080/#/c/6322/1/fe/src/main/java/org/apache/impala/analysis/QueryStmt.java File fe/src/main/java/org/apache/impala/analysis/QueryStmt.java: Line 253: if (!(smap.getLhs().get(i) instanceof SlotRef) > use {} for multi-line if Done http://gerrit.cloudera.org:8080/#/c/6322/1/fe/src/main/java/org/apache/impala/analysis/SortInfo.java File fe/src/main/java/org/apache/impala/analysis/SortInfo.java: Line 39: // All Exprs with cost greater than this will be materialized. Since we don't currently > Remove TODO, instead comment where this value came from, maybe with some ex Done Line 46: > needs a brief comment, in particular why we need to store them Done Line 172: */ > We should collect the SlotRefs only for non-materialized ordering exprs. Done Line 180: // substOrderBy is a mapping from slot refs in the sort node's input and ordering > update comment to reflect the new contents of the substOrderBy smap Done Line 196: SlotDescriptor origSlotDesc = origSlotRef.getDesc(); > update comment, we are not only substituting slot refs Done Line 208: substituteOrderingExprs(substOrderBy, analyzer); > Say something about the args of this function. Also mention side-effects (p Done Line 210: // Update the tuple descriptor used to materialize the input of the sort. > make this the first sentence in this comment Done Line 211: setMaterializedTupleInfo(sortTupleDesc, sortTupleExprs); > no more hint, right? Done Line 214: } > from the materialized sort exprs to the new SlotRefs Done Line 216: /** > Instead of returning a new smap and combining with the other one, why not p I updated it to take the smap as a parameter, but the problem with returning the materialized exprs to avoid side effects is that we also need to update isAsc and nullsFirst. Line 217: * Materialize ordering exprs that are non-deterministic, are more expensive than a cost > analyzer needed? For 'addSlotDescriptor' Line 219: * 'sortTupleDesc' as their parent. > remove Done Line 228: * exprs that get materialized. > Why TODO? What happens when there are no sort exprs at all left? There's already a lot in this review, and I wasn't sure how easy this would be. I'm happy to take a look if you want. If there are no sort exprs, we still go through the whole sort but every row is considered equal (by TupleRowComparator::Compare"). This is the same behavior as would currently happen, except slightly faster as the literal expressions don't get repeatedly evaluated and compared. I added a test to at least show it works as expected. http://gerrit.cloudera.org:8080/#/c/6322/1/fe/src/test/java/org/apache/impala/planner/PlannerTest.java File fe/src/test/java/org/apache/impala/planner/PlannerTest.java: Line 374: public void testSortExprMaterialization() { > testSortExprMaterialization Done http://gerrit.cloudera.org:8080/#/c/6322/1/testdata/workloads/functional-planner/queries/PlannerTest/sort-materialization.test File testdata/workloads/functional-planner/queries/PlannerTest/sort-materialization.test: Line 19 > I think these tests need more examples cheap and expensive exprs to make su I'm not sure how to make that work with PlannerTest, but I added a test to query_tests/test_udfs that works. For UDAs, assuming that you mean something like: select bool_col, sum(min(int_col)) over(order by uda(int_col)) from functional.alltypes group by 1; I believe that will already be materialized, in a SlotRef created in AggregateInfoBase.createTupleDesc. Line 37 > Can you ask Greg about some specific expensive ordering exprs that have com Done Line 43 > i think it's questionable whether something like this should be materialize Its very difficult to say what a good threshold would be since our cost model is pretty bad - partially my fault, since I wrote it in the first place, but also because we don't have any information about the actual cost of functions. Given the discussion on the design doc that not materializing an expensive expr is probably a bigger problem than materializing a cheap expr, I'm inclined to set it relatively low, especially if we're not going to provide an override to the user. For example, the function 'regex_replace' should definitely be materialized but depending on its parameters it can be costed as low as 13 (10 for the function call + 3 parameters). Unfortunately that's also the lowest possible cost that 'if' could have (its actually more expensive here due to the binary predicate). So, maybe we set the threshold as FUNCTION_CALL_COST (10) so that way all expensive functions are guaranteed to be materialized but things like simple arithmetic operations are left non-materialized (for example slotref + slotref + slotref would be costed as 5). And of course we should consider improving the cost model, though not in this patch. Line 106 > It doesn't seem right to print this here because the merging exchange does Done http://gerrit.cloudera.org:8080/#/c/6322/1/tests/query_test/test_sort.py File tests/query_test/test_sort.py: Line 165: # "order by random()" with different seeds should produce different orderings. > Confusing sentence, what is 'unordered'? Done Line 168: results_seed1 = self.execute_query(seed_query % "1") > remove Done -- To view, visit http://gerrit.cloudera.org:8080/6322 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ifefdaff8557a30ac44ea82ed428e6d1ffbca2e9e Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Thomas Tauber-Marshall <[email protected]> Gerrit-Reviewer: Alex Behm <[email protected]> Gerrit-Reviewer: Marcel Kornacker <[email protected]> Gerrit-Reviewer: Thomas Tauber-Marshall <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-HasComments: Yes
