Dimitris Tsirogiannis has posted comments on this change. ( http://gerrit.cloudera.org:8080/9777 )
Change subject: IMPALA-4025: Part 1: Add percentile_disc aggregation function ...................................................................... Patch Set 1: (17 comments) http://gerrit.cloudera.org:8080/#/c/9777/1//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/9777/1//COMMIT_MSG@7 PS1, Line 7: Part 1 What's in Part 2? http://gerrit.cloudera.org:8080/#/c/9777/1//COMMIT_MSG@24 PS1, Line 24: impala returns NULL Why is that? http://gerrit.cloudera.org:8080/#/c/9777/1/fe/src/main/java/org/apache/impala/analysis/PercentileAggExpr.java File fe/src/main/java/org/apache/impala/analysis/PercentileAggExpr.java: http://gerrit.cloudera.org:8080/#/c/9777/1/fe/src/main/java/org/apache/impala/analysis/PercentileAggExpr.java@49 PS1, Line 49: Rewrite 'stmt' into a child query and a parent query "Rewrite 'stmt' containing percentile functions in an equivalent statement using subqueries. Percentile functions may exist in: a) the select list, b) the HAVING clause, and c) the ORDER BY clause.". Also, do we assume that 'stmt' contains at least one percentile fn or it doesn't matter? What happens if there is no percentile fn? Maybe mention that as well. http://gerrit.cloudera.org:8080/#/c/9777/1/fe/src/main/java/org/apache/impala/analysis/PercentileAggExpr.java@52 PS1, Line 52: "select percentile_disc(0.5) within group (order by col) from tbl" : * would be rewritten to "select min(case when _percentile_row_number_diff_0 >= 0 then : * _percentile_view.col end) from (select row_number() over (order by col) - : * 0.5 * count(col) over() _percentile_row_number_diff_0 from tbl) _percentile_view". I think a more elaborate example may help here. For instance: "select percentile_disc(0.5) within group (order by col), b from tbl where c < 10 group by b having percentile_disc(0.1) within group (order by col) > 10;" That example would highlight a few more things such as as the generation of multiple child query select items, how the WHERE clause of the original stmt is transferred to the subquery, and how the having clause is rewritten as well. http://gerrit.cloudera.org:8080/#/c/9777/1/fe/src/main/java/org/apache/impala/analysis/PercentileAggExpr.java@56 PS1, Line 56: 'stmt' needs to be reanalyzed after the rewrite. Is the assumption that 'stmt' is analyzed before calling this function? If so, mention that in the comment and add a precondition check at the beginning of the function. http://gerrit.cloudera.org:8080/#/c/9777/1/fe/src/main/java/org/apache/impala/analysis/PercentileAggExpr.java@58 PS1, Line 58: static void rewriteSelectStmt(SelectStmt stmt) { Shouldn't that be in StmtRewriter.java? http://gerrit.cloudera.org:8080/#/c/9777/1/fe/src/main/java/org/apache/impala/analysis/PercentileAggExpr.java@59 PS1, Line 59: new ArrayList<>(); nit: just for consistency's sake, can you plz use Guava's newArrayList() static function? http://gerrit.cloudera.org:8080/#/c/9777/1/fe/src/main/java/org/apache/impala/analysis/PercentileAggExpr.java@59 PS1, Line 59: ArrayList<SelectListItem> childSelectListItems = new ArrayList<>(); Comment what is stored here. http://gerrit.cloudera.org:8080/#/c/9777/1/fe/src/main/java/org/apache/impala/analysis/PercentileAggExpr.java@63 PS1, Line 63: reWriteExpr(item.getExpr(), stmt.groupingExprs_, childSelectListItems)); Just by looking at this piece of code, it's not clear the extend to which exprs are rewritten, i.e. do you handle only the percentile case or other expr rewrites as well? Maybe rename the function to something more explicit like rewritePercentileExpr or something along these lines. http://gerrit.cloudera.org:8080/#/c/9777/1/fe/src/main/java/org/apache/impala/analysis/PercentileAggExpr.java@84 PS1, Line 84: (TableSampleClause) nit: do you need that cast? http://gerrit.cloudera.org:8080/#/c/9777/1/fe/src/main/java/org/apache/impala/analysis/PercentileAggExpr.java@90 PS1, Line 90: The generated child query select : * list items will be added to 'childSelectListItems' and the rewritten expr will be : * returned. This is a bit too dense and not easy to follow. I think it may be better to describe this as follows. From each percentile expr 'e', two new exprs are generated (based on the transformation described in rewriteSelectStmt() function), one that replaces 'e' in the select list of the original statement and one that is added to the select list of the generated subquery. http://gerrit.cloudera.org:8080/#/c/9777/1/fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java File fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java: http://gerrit.cloudera.org:8080/#/c/9777/1/fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java@1001 PS1, Line 1001: does "do" http://gerrit.cloudera.org:8080/#/c/9777/1/fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java File fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java: http://gerrit.cloudera.org:8080/#/c/9777/1/fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java@2329 PS1, Line 2329: There are many more interesting test cases such as: a) percentiles in select list, having and order by, b) percentiles in subqueries, etc. Also, numerical exprs or functions in the percentile_disc (e.g. percentile_disc(0.5+0.1), etc). http://gerrit.cloudera.org:8080/#/c/9777/1/testdata/workloads/functional-planner/queries/PlannerTest/aggregation.test File testdata/workloads/functional-planner/queries/PlannerTest/aggregation.test: http://gerrit.cloudera.org:8080/#/c/9777/1/testdata/workloads/functional-planner/queries/PlannerTest/aggregation.test@1455 PS1, Line 1455: ==== Percentile functions may show up in both having and order by clauses. Add a few tests cases to cover these as well. http://gerrit.cloudera.org:8080/#/c/9777/1/testdata/workloads/functional-planner/queries/PlannerTest/subquery-rewrite.test File testdata/workloads/functional-planner/queries/PlannerTest/subquery-rewrite.test: http://gerrit.cloudera.org:8080/#/c/9777/1/testdata/workloads/functional-planner/queries/PlannerTest/subquery-rewrite.test@2352 PS1, Line 2352: ==== Also add a case where a percentile shows up in the subquery (scalar and non-scalar subqueries) http://gerrit.cloudera.org:8080/#/c/9777/1/testdata/workloads/functional-query/queries/QueryTest/aggregation.test File testdata/workloads/functional-query/queries/QueryTest/aggregation.test: http://gerrit.cloudera.org:8080/#/c/9777/1/testdata/workloads/functional-query/queries/QueryTest/aggregation.test@1377 PS1, Line 1377: select percentile_disc(-1) within group (order by tinyint_col) from alltypesagg; : ---- RESULTS : NULL : ---- TYPES : TINYINT : ==== : ---- QUERY : select percentile_disc(1.1) within group (order by tinyint_col) from alltypesagg; : ---- RESULTS : NULL : ---- TYPES Just write one query that covers these cases "percentile_disc(1.1)..., percentile_disc(0.5)..., percentile_disc(-1)...." http://gerrit.cloudera.org:8080/#/c/9777/1/testdata/workloads/functional-query/queries/QueryTest/aggregation.test@1414 PS1, Line 1414: select percentile_disc(0) within group (order by tinyint_col) from alltypesagg; : ---- RESULTS : 1 : ---- TYPES : TINYINT : ==== : ---- QUERY : select percentile_disc(1) within group (order by tinyint_col) from alltypesagg; : ---- RESULTS : 9 : ---- TYPES : TINYINT : ==== : ---- QUERY : # Test decreasing percentile : select percentile_disc(0) within group (order by tinyint_col desc) from alltypesagg; : ---- RESULTS : 9 : ---- TYPES : TINYINT : ==== : ---- QUERY : select percentile_disc(1) within group (order by tinyint_col desc) from alltypesagg; : ---- RESULTS : 1 : ---- TYPES : TINYINT Same comment as before. I'd just replace these with one query and multiple select list items. -- To view, visit http://gerrit.cloudera.org:8080/9777 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iacef7b3fcd74c4c73d88400ce27307c3baa0121e Gerrit-Change-Number: 9777 Gerrit-PatchSet: 1 Gerrit-Owner: Tianyi Wang <[email protected]> Gerrit-Reviewer: Dimitris Tsirogiannis <[email protected]> Gerrit-Comment-Date: Tue, 27 Mar 2018 22:11:37 +0000 Gerrit-HasComments: Yes
