Dimitris Tsirogiannis has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9777 )

Change subject: IMPALA-4025: Part 1: Add percentile_disc aggregation function
......................................................................


Patch Set 1:

(17 comments)

http://gerrit.cloudera.org:8080/#/c/9777/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/9777/1//COMMIT_MSG@7
PS1, Line 7: Part 1
What's in Part 2?


http://gerrit.cloudera.org:8080/#/c/9777/1//COMMIT_MSG@24
PS1, Line 24: impala returns NULL
Why is that?


http://gerrit.cloudera.org:8080/#/c/9777/1/fe/src/main/java/org/apache/impala/analysis/PercentileAggExpr.java
File fe/src/main/java/org/apache/impala/analysis/PercentileAggExpr.java:

http://gerrit.cloudera.org:8080/#/c/9777/1/fe/src/main/java/org/apache/impala/analysis/PercentileAggExpr.java@49
PS1, Line 49: Rewrite 'stmt' into a child query and a parent query
"Rewrite 'stmt' containing percentile functions in an equivalent statement 
using subqueries. Percentile functions may exist in: a) the select list, b) the 
HAVING clause, and c) the ORDER BY clause.". Also, do we assume that 'stmt' 
contains at least one percentile fn or it doesn't matter? What happens if there 
is no percentile fn? Maybe mention that as well.


http://gerrit.cloudera.org:8080/#/c/9777/1/fe/src/main/java/org/apache/impala/analysis/PercentileAggExpr.java@52
PS1, Line 52: "select percentile_disc(0.5) within group (order by col) from tbl"
            :    * would be rewritten to "select min(case when 
_percentile_row_number_diff_0 >= 0 then
            :    * _percentile_view.col end) from (select row_number() over 
(order by col) -
            :    * 0.5 * count(col) over() _percentile_row_number_diff_0 from 
tbl) _percentile_view".
I think a more elaborate example may help here. For instance: "select 
percentile_disc(0.5) within group (order by col), b from tbl where c < 10 group 
by b having percentile_disc(0.1) within group (order by col) > 10;" That 
example would highlight a few more things such as as the generation of multiple 
child query select items, how the WHERE clause of the original stmt is 
transferred to the subquery, and how the having clause is rewritten as well.


http://gerrit.cloudera.org:8080/#/c/9777/1/fe/src/main/java/org/apache/impala/analysis/PercentileAggExpr.java@56
PS1, Line 56: 'stmt' needs to be reanalyzed after the rewrite.
Is the assumption that 'stmt' is analyzed before calling this function? If so, 
mention that in the comment and add a precondition check at the beginning of 
the function.


http://gerrit.cloudera.org:8080/#/c/9777/1/fe/src/main/java/org/apache/impala/analysis/PercentileAggExpr.java@58
PS1, Line 58: static void rewriteSelectStmt(SelectStmt stmt) {
Shouldn't that be in StmtRewriter.java?


http://gerrit.cloudera.org:8080/#/c/9777/1/fe/src/main/java/org/apache/impala/analysis/PercentileAggExpr.java@59
PS1, Line 59: new ArrayList<>();
nit: just for consistency's sake, can you plz use Guava's newArrayList() static 
function?


http://gerrit.cloudera.org:8080/#/c/9777/1/fe/src/main/java/org/apache/impala/analysis/PercentileAggExpr.java@59
PS1, Line 59: ArrayList<SelectListItem> childSelectListItems = new 
ArrayList<>();
Comment what is stored here.


http://gerrit.cloudera.org:8080/#/c/9777/1/fe/src/main/java/org/apache/impala/analysis/PercentileAggExpr.java@63
PS1, Line 63: reWriteExpr(item.getExpr(), stmt.groupingExprs_, 
childSelectListItems));
Just by looking at this piece of code, it's not clear the extend to which exprs 
are rewritten, i.e. do you handle only the percentile case or other expr 
rewrites as well? Maybe rename the function to something more explicit like 
rewritePercentileExpr or something along these lines.


http://gerrit.cloudera.org:8080/#/c/9777/1/fe/src/main/java/org/apache/impala/analysis/PercentileAggExpr.java@84
PS1, Line 84: (TableSampleClause)
nit: do you need that cast?


http://gerrit.cloudera.org:8080/#/c/9777/1/fe/src/main/java/org/apache/impala/analysis/PercentileAggExpr.java@90
PS1, Line 90: The generated child query select
            :    * list items will be added to 'childSelectListItems' and the 
rewritten expr will be
            :    * returned.
This is a bit too dense and not easy to follow. I think it may be better to 
describe this as follows. From each percentile expr 'e', two new exprs are 
generated (based on the transformation described in rewriteSelectStmt() 
function), one that replaces 'e' in the select list of the original statement 
and one that is added to the select list of the generated subquery.


http://gerrit.cloudera.org:8080/#/c/9777/1/fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
File fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java:

http://gerrit.cloudera.org:8080/#/c/9777/1/fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java@1001
PS1, Line 1001: does
"do"


http://gerrit.cloudera.org:8080/#/c/9777/1/fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java
File fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java:

http://gerrit.cloudera.org:8080/#/c/9777/1/fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java@2329
PS1, Line 2329:
There are many more interesting test cases such as: a) percentiles in select 
list, having and order by, b) percentiles in subqueries, etc. Also, numerical 
exprs or functions in the percentile_disc (e.g. percentile_disc(0.5+0.1), etc).


http://gerrit.cloudera.org:8080/#/c/9777/1/testdata/workloads/functional-planner/queries/PlannerTest/aggregation.test
File testdata/workloads/functional-planner/queries/PlannerTest/aggregation.test:

http://gerrit.cloudera.org:8080/#/c/9777/1/testdata/workloads/functional-planner/queries/PlannerTest/aggregation.test@1455
PS1, Line 1455: ====
Percentile functions may show up in both having and order by clauses. Add a few 
tests cases to cover these as well.


http://gerrit.cloudera.org:8080/#/c/9777/1/testdata/workloads/functional-planner/queries/PlannerTest/subquery-rewrite.test
File 
testdata/workloads/functional-planner/queries/PlannerTest/subquery-rewrite.test:

http://gerrit.cloudera.org:8080/#/c/9777/1/testdata/workloads/functional-planner/queries/PlannerTest/subquery-rewrite.test@2352
PS1, Line 2352: ====
Also add a case where a percentile shows up in the subquery (scalar and 
non-scalar subqueries)


http://gerrit.cloudera.org:8080/#/c/9777/1/testdata/workloads/functional-query/queries/QueryTest/aggregation.test
File testdata/workloads/functional-query/queries/QueryTest/aggregation.test:

http://gerrit.cloudera.org:8080/#/c/9777/1/testdata/workloads/functional-query/queries/QueryTest/aggregation.test@1377
PS1, Line 1377: select percentile_disc(-1) within group (order by tinyint_col) 
from alltypesagg;
              : ---- RESULTS
              : NULL
              : ---- TYPES
              : TINYINT
              : ====
              : ---- QUERY
              : select percentile_disc(1.1) within group (order by tinyint_col) 
from alltypesagg;
              : ---- RESULTS
              : NULL
              : ---- TYPES
Just write one query that covers these cases "percentile_disc(1.1)..., 
percentile_disc(0.5)..., percentile_disc(-1)...."


http://gerrit.cloudera.org:8080/#/c/9777/1/testdata/workloads/functional-query/queries/QueryTest/aggregation.test@1414
PS1, Line 1414: select percentile_disc(0) within group (order by tinyint_col) 
from alltypesagg;
              : ---- RESULTS
              : 1
              : ---- TYPES
              : TINYINT
              : ====
              : ---- QUERY
              : select percentile_disc(1) within group (order by tinyint_col) 
from alltypesagg;
              : ---- RESULTS
              : 9
              : ---- TYPES
              : TINYINT
              : ====
              : ---- QUERY
              : # Test decreasing percentile
              : select percentile_disc(0) within group (order by tinyint_col 
desc) from alltypesagg;
              : ---- RESULTS
              : 9
              : ---- TYPES
              : TINYINT
              : ====
              : ---- QUERY
              : select percentile_disc(1) within group (order by tinyint_col 
desc) from alltypesagg;
              : ---- RESULTS
              : 1
              : ---- TYPES
              : TINYINT
Same comment as before. I'd just replace these with one query and multiple 
select list items.



--
To view, visit http://gerrit.cloudera.org:8080/9777
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iacef7b3fcd74c4c73d88400ce27307c3baa0121e
Gerrit-Change-Number: 9777
Gerrit-PatchSet: 1
Gerrit-Owner: Tianyi Wang <[email protected]>
Gerrit-Reviewer: Dimitris Tsirogiannis <[email protected]>
Gerrit-Comment-Date: Tue, 27 Mar 2018 22:11:37 +0000
Gerrit-HasComments: Yes

Reply via email to