Alex Behm has posted comments on this change.

Change subject: IMPALA-4731/IMPALA-397/IMPALA-4728: Materialize sort exprs
......................................................................


Patch Set 4:

(17 comments)

http://gerrit.cloudera.org:8080/#/c/6322/2/fe/src/main/java/org/apache/impala/analysis/SortInfo.java
File fe/src/main/java/org/apache/impala/analysis/SortInfo.java:

Line 240:         materializedDesc.setIsMaterialized(true);
> As discussed, I removed the handling of literals.
Ahh right, let me explain just to close on this:

select r from (select rand() r from t) v order by r

Here "r" in the outer query block is a SlotRef into the virtual inline-view 
tuple. So we will just materialize that SlotRef.

On the other hand, it means our cost-based materialization decision may not be 
accurate through inline views (but that's a much less severe issue). For 
example,

select c from (select c1 + c2 c from t) v order by c

we will likewise materialize "c" even though the underlying expr "c1 + c2" is 
cheap according to our costing.

Still sounds worth filing a JIRA for making this decision more accurate (i.e., 
during planning), but I think we can definitely consider all JIRAs in the 
commit msg fixed after this patch :)


http://gerrit.cloudera.org:8080/#/c/6322/4/fe/src/main/java/org/apache/impala/analysis/SortInfo.java
File fe/src/main/java/org/apache/impala/analysis/SortInfo.java:

Line 39:   // All Exprs with cost greater than this will be materialized. Since 
we don't currently
All ordering exprs with cost...


Line 40:   // have any information about actual function costs, this value 
guarantees that all
guarantees -> is intended to ensure


Line 42:   // operations unmaterialized, for example 'SlotRef + SlotRef' would 
have cost 3 and not
remove cost "3" since this might change and it will be hard to tell this 
comment needs to be updated


Line 167:    * output by the sort node. Done by first calling 
createMaterailzedOrderExprs() to
The "Done by" sentence sounds a little procedural/complicated, can you 
condense/simplify it?


Line 170:    * expressions. The materialized exprs are substituted with slot 
refs into the new
exprs


Line 181:     // substOrderBy is a mapping from:
Can be improved for clarity. Also point 2 is not quite accurate.

Here's a proposal:

// Mapping from exprs evaluated on the sort input that get materialized into 
the sort tuple to their corresponding SlotRefs in the sort tuple. The following 
exprs are materialized:
1. Ordering exprs that we chose to materialize
2. SlotRefs against the sort input contained in the result and ordering exprs 
after substituting the materialized ordering exprs


Line 191:     Set<SlotRef> sourceSlots = Sets.newHashSet();
// SlotRefs in the result and ordering exprs after substituting the 
materialized ordering exprs.


Line 201:       if (!substOrderBy.getRhs().contains(origSlotRef)) {
contains() is expensive (and every iteration might grow the rhs).
I suggest checking whether a SlotRef references the sort tuple id.

I know we use contains() elsewhere extensively, but mostly because we have no 
simple and cheap alternative. Here the alternative is pretty simple.


Line 222:    * Materialize ordering exprs that are non-deterministic, contain a 
UDF (since we don't
Easier to read if you have bullet points for the things we materialize, e.g.:
Materialize the following ordering exprs:
- non-deterministic exprs
- exprs that contain a UDF (since ...)


http://gerrit.cloudera.org:8080/#/c/6322/4/testdata/workloads/functional-planner/queries/PlannerTest/sort-expr-materialization.test
File 
testdata/workloads/functional-planner/queries/PlannerTest/sort-expr-materialization.test:

Line 1: #sort on a non-deterministic expr, gets materialized
space after #


Line 118: # expensive order expr materialized in merging exchange
Same as case above? We don't materialize in a merging exchange, the preceding 
sort always does that.


Line 139: # sort on udf, gets materialized
Need additional tests to cover the tricky code to avoid redundant expr 
materializations. Examples for inspiration:

select concat(a, b, c, d) c from t order by c


http://gerrit.cloudera.org:8080/#/c/6322/4/tests/query_test/test_sort.py
File tests/query_test/test_sort.py:

Line 158: class TestRandomSort(ImpalaTestSuite):
This will run over all test vectors right? I think it makes sense to only run 
this over text/none or parquet/none. You can try running this with exhaustive 
and see how often this test is run.


Line 172:     # Include "random()" in the select list to check that its sorted 
correctly.
it's


Line 174:         "select random() as r from functional.alltypessmall order by 
r").data,
also add one with an inline view:

select r from (select random() r functional.alltypessmall) v order by r


Line 185:     """Tests that a window function over 'order by random()' worked 
as expected."""
works as expected


-- 
To view, visit http://gerrit.cloudera.org:8080/6322
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ifefdaff8557a30ac44ea82ed428e6d1ffbca2e9e
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall <[email protected]>
Gerrit-Reviewer: Alex Behm <[email protected]>
Gerrit-Reviewer: Marcel Kornacker <[email protected]>
Gerrit-Reviewer: Thomas Tauber-Marshall <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-HasComments: Yes

Reply via email to