IMPALA-4728: Materialize sort exprs

Thomas Tauber-Marshall (Code Review) Wed, 15 Mar 2017 07:45:21 -0700

Thomas Tauber-Marshall has posted comments on this change.

Change subject: IMPALA-4731/IMPALA-397/IMPALA-4728: Materialize sort exprs
......................................................................



Patch Set 2:

(24 comments)

http://gerrit.cloudera.org:8080/#/c/6322/1/fe/src/main/java/org/apache/impala/analysis/Expr.java
File fe/src/main/java/org/apache/impala/analysis/Expr.java:

Line 212:   // True if this expr always returns the same value given the same 
input, false if it
> True if this exprs always returns the same value given the same input. Fals
Done


Line 213:   // does not, or if the behavior is unknown (e.g. UDFs). Set at the 
end of analyze() and
> e.g.
Done


http://gerrit.cloudera.org:8080/#/c/6322/1/fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java
File fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java:

Line 292:     // We currently don't have a way to indicate if a UDF is 
deterministic, so just always
> Ideally we should not conflate the UDF and deterministic concepts. For exam
Not sure what you mean by 'two separate members". Maybe 
Expr.containsNonDeterministicBuiltin and Expr.containsUDF?

Also, it doesn't seem like FunctionCallExpr.isNondeterministicBuiltinFn() and 
Expr.isDeterministic() could ever be guaranteed to return opposite values, 
since a deterministic function that has non-deterministic parameters would 
constitute a non-deterministic expr.


http://gerrit.cloudera.org:8080/#/c/6322/1/fe/src/main/java/org/apache/impala/analysis/QueryStmt.java
File fe/src/main/java/org/apache/impala/analysis/QueryStmt.java:

Line 253:       if (!(smap.getLhs().get(i) instanceof SlotRef)
> use {} for multi-line if
Done


http://gerrit.cloudera.org:8080/#/c/6322/1/fe/src/main/java/org/apache/impala/analysis/SortInfo.java
File fe/src/main/java/org/apache/impala/analysis/SortInfo.java:

Line 39:   // All Exprs with cost greater than this will be materialized. Since 
we don't currently
> Remove TODO, instead comment where this value came from, maybe with some ex
Done


Line 46: 
> needs a brief comment, in particular why we need to store them
Done


Line 172:    */
> We should collect the SlotRefs only for non-materialized ordering exprs.
Done


Line 180:     // substOrderBy is a mapping from slot refs in the sort node's 
input and ordering
> update comment to reflect the new contents of the substOrderBy smap
Done


Line 196:       SlotDescriptor origSlotDesc = origSlotRef.getDesc();
> update comment, we are not only substituting slot refs
Done


Line 208:     substituteOrderingExprs(substOrderBy, analyzer);
> Say something about the args of this function. Also mention side-effects (p
Done


Line 210:     // Update the tuple descriptor used to materialize the input of 
the sort.
> make this the first sentence in this comment
Done


Line 211:     setMaterializedTupleInfo(sortTupleDesc, sortTupleExprs);
> no more hint, right?
Done


Line 214:   }
> from the materialized sort exprs to the new SlotRefs
Done


Line 216:   /**
> Instead of returning a new smap and combining with the other one, why not p
I updated it to take the smap as a parameter, but the problem with returning 
the materialized exprs to avoid side effects is that we also need to update 
isAsc and nullsFirst.


Line 217:    * Materialize ordering exprs that are non-deterministic, are more 
expensive than a cost
> analyzer needed?
For 'addSlotDescriptor'


Line 219:    * 'sortTupleDesc' as their parent.
> remove
Done


Line 228:    * exprs that get materialized.
> Why TODO? What happens when there are no sort exprs at all left?
There's already a lot in this review, and I wasn't sure how easy this would be. 
I'm happy to take a look if you want.

If there are no sort exprs, we still go through the whole sort but every row is 
considered equal (by TupleRowComparator::Compare"). This is the same behavior 
as would currently happen, except slightly faster as the literal expressions 
don't get repeatedly evaluated and compared.

I added a test to at least show it works as expected.


http://gerrit.cloudera.org:8080/#/c/6322/1/fe/src/test/java/org/apache/impala/planner/PlannerTest.java
File fe/src/test/java/org/apache/impala/planner/PlannerTest.java:

Line 374:   public void testSortExprMaterialization() {
> testSortExprMaterialization
Done


http://gerrit.cloudera.org:8080/#/c/6322/1/testdata/workloads/functional-planner/queries/PlannerTest/sort-materialization.test
File 
testdata/workloads/functional-planner/queries/PlannerTest/sort-materialization.test:

Line 19
> I think these tests need more examples cheap and expensive exprs to make su
I'm not sure how to make that work with PlannerTest, but I added a test to 
query_tests/test_udfs that works.

For UDAs, assuming that you mean something like:
select bool_col, sum(min(int_col)) over(order by uda(int_col)) from 
functional.alltypes group by 1;
I believe that will already be materialized, in a SlotRef created in 
AggregateInfoBase.createTupleDesc.


Line 37
> Can you ask Greg about some specific expensive ordering exprs that have com
Done


Line 43
> i think it's questionable whether something like this should be materialize
Its very difficult to say what a good threshold would be since our cost model 
is pretty bad - partially my fault, since I wrote it in the first place, but 
also because we don't have any information about the actual cost of functions.

Given the discussion on the design doc that not materializing an expensive expr 
is probably a bigger problem than materializing a cheap expr, I'm inclined to 
set it relatively low, especially if we're not going to provide an override to 
the user.

For example, the function 'regex_replace' should definitely be materialized but 
depending on its parameters it can be costed as low as 13 (10 for the function 
call + 3 parameters). Unfortunately that's also the lowest possible cost that 
'if' could have (its actually more expensive here due to the binary predicate).

So, maybe we set the threshold as FUNCTION_CALL_COST (10) so that way all 
expensive functions are guaranteed to be materialized but things like simple 
arithmetic operations are left non-materialized (for example slotref + slotref 
+ slotref would be costed as 5).

And of course we should consider improving the cost model, though not in this 
patch.


Line 106
> It doesn't seem right to print this here because the merging exchange does 
Done


http://gerrit.cloudera.org:8080/#/c/6322/1/tests/query_test/test_sort.py
File tests/query_test/test_sort.py:

Line 165:     # "order by random()" with different seeds should produce 
different orderings.
> Confusing sentence, what is 'unordered'?
Done


Line 168:     results_seed1 = self.execute_query(seed_query % "1")
> remove
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/6322
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ifefdaff8557a30ac44ea82ed428e6d1ffbca2e9e
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall <[email protected]>
Gerrit-Reviewer: Alex Behm <[email protected]>
Gerrit-Reviewer: Marcel Kornacker <[email protected]>
Gerrit-Reviewer: Thomas Tauber-Marshall <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4731/IMPALA-397/IMPALA-4728: Materialize sort exprs

Reply via email to