[Impala-CR](cdh5-trunk) IMPALA-3210: last/first value() support for IGNORE NULLS

Thomas Tauber-Marshall (Code Review) Thu, 07 Jul 2016 12:18:37 -0700

Thomas Tauber-Marshall has posted comments on this change.

Change subject: IMPALA-3210: last/first_value() support for IGNORE NULLS
......................................................................



Patch Set 7:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/3328/7/fe/src/main/java/com/cloudera/impala/analysis/AnalyticExpr.java
File fe/src/main/java/com/cloudera/impala/analysis/AnalyticExpr.java:

Line 592:    * 4. UNBOUNDED FOLLOWING windows:
> combine 3 and 4, it looks like the same transformation is applied
Done


Line 661:     // 3.
> see above
Done


http://gerrit.cloudera.org:8080/#/c/3328/7/fe/src/main/java/com/cloudera/impala/catalog/BuiltinsDb.java
File fe/src/main/java/com/cloudera/impala/catalog/BuiltinsDb.java:

Line 491:   private static final Map<Type, String> 
LAST_VALUE_IGNORE_NULLS_INIT_SYMBOL =
> instead of registering a new (wrapper) function, why not simply have the ex
We wanted to avoid adding anything to the last_value implementation that would 
slow down the common case of not-'ignore nulls', such as having to check a 
boolean flag on every call to Add, Remove, and Get on every row of the table, 
and we also wanted to avoid putting any special cases in the backend for this.

I don't mind changing it if you think that's a cleaner design, though.


Line 979:           prefix + UPDATE_VAL_SYMBOL.get(t),
> is update_val_symbol used anywhere else?
No.

UpdateVal is the same function as LastValUpdate. We renamed it because the 
implementation is being reused in the backend, and it makes it clearer what's 
going on there.


http://gerrit.cloudera.org:8080/#/c/3328/7/testdata/workloads/functional-planner/queries/PlannerTest/analytic-fns.test
File 
testdata/workloads/functional-planner/queries/PlannerTest/analytic-fns.test:

Line 1555: |  window: RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
> why is this one not coalesced w/ the parent analytic node?
In the input query, none of the functions have the same window and ordering. 
Previously, we were able to coalesce two of the first_value calls because we 
were rewriting first_value(unbounded preceding to unbounded following) to 
first_value(unbounded preceding to current row), which are equivalent in the 
non-'ignore nulls' case.

They are not equivalent in the 'ignore nulls' case, though (eg. it may be the 
case that every value up to and including current row is NULL but some 
following value is non-null such that the correct value to return for the 
window (ub preceding to ub following) wouldn't even be in the window (ub 
preceding to current row).

We thought that it was preferable to keep the code simple by doing the same 
standardization in the 'ignore nulls' and non-'ignore nulls' cases, but 
certainly I could add this standardization back in just for the non-'ignore 
nulls' case if we think its important.

However, I actually ran a performance test with this query (modified to run on 
tpch since functional.alltypesagg is small) and the new version is actually 
slightly faster, because it performs 2 sorts instead of 3, and the sorts are 
more expensive than the analytic nodes, so its hard to say what the performance 
effects of these changes will be.

 
+-----------------+-------------------+-----------------------+--------+-------------+------------+-----------+----------------+-------------+-------+
 | Workload        | Query             | File Format           | Avg(s) | Base 
Avg(s) | Delta(Avg) | StdDev(%) | Base StdDev(%) | Num Clients | Iters |
 
+-----------------+-------------------+-----------------------+--------+-------------+------------+-----------+----------------+-------------+-------+
 | TARGETED-PERF() | PLANNER_TEST_PERF | parquet / none / none | 40.03  | 42.24 
      |   -5.25%   |   0.14%   |   1.40%        | 1           | 3     |
 
+-----------------+-------------------+-----------------------+--------+-------------+------------+-----------+----------------+-------------+-------+


Line 2187: # Start LAST_VALUE IGNORE NULLS
> is there a reason this needs new test cases (or could this be incorporated 
Done


Line 2190:   last_value(tinyint_col ignore nulls) over (order by id rows 
between unbounded preceding and 1 preceding)
> long line (and elsewhere)
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/3328
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ic27525e2237fb54318549d2674f1610884208e9b
Gerrit-PatchSet: 7
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Thomas Tauber-Marshall <[email protected]>
Gerrit-Reviewer: Marcel Kornacker <[email protected]>
Gerrit-Reviewer: Matthew Jacobs <[email protected]>
Gerrit-Reviewer: Thomas Tauber-Marshall <[email protected]>
Gerrit-HasComments: Yes

[Impala-CR](cdh5-trunk) IMPALA-3210: last/first value() support for IGNORE NULLS

Reply via email to