Daniel Becker has posted comments on this change. ( http://gerrit.cloudera.org:8080/19569 )
Change subject: IMPALA-11957: Implement Regression functions: regr_slope(), regr_intercept() and regr_r2() ...................................................................... Patch Set 1: (34 comments) Thanks Pranav! I wonder if we could create some script that tested these functions against some other database which we would like to be compatible with, for example Hive. We should run the same queries in both engines and expect the same results. I'm not sure we can insert it into our test framework but it would be good if we could at least manually run a test like this before adding these functions. The tests in aggregation.test would be a good starting point for the queries to test against Hive (or Postrgesql etc.). http://gerrit.cloudera.org:8080/#/c/19569/1//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/19569/1//COMMIT_MSG@9 PS1, Line 9: The linear regression functions fit an ordinary-least-squares regression line Lines in the commit message should be at most 72 chars long. http://gerrit.cloudera.org:8080/#/c/19569/1//COMMIT_MSG@10 PS1, Line 10: pairs which It would be clearer this way: "... pairs. They can be used ...". http://gerrit.cloudera.org:8080/#/c/19569/1//COMMIT_MSG@16 PS1, Line 16: of determination (also called R-squared or goodness of fit) for the regression. Please add a "Testing". http://gerrit.cloudera.org:8080/#/c/19569/1/be/src/exprs/aggregate-functions-ir.cc File be/src/exprs/aggregate-functions-ir.cc: http://gerrit.cloudera.org:8080/#/c/19569/1/be/src/exprs/aggregate-functions-ir.cc@291 PS1, Line 291: // Implementation of regr_slope() and regr_intercept(): Did you use some source for either the implementation or the specification of these regression functions (i.e. some description you based your implementation on)? If so, please reference it here. http://gerrit.cloudera.org:8080/#/c/19569/1/be/src/exprs/aggregate-functions-ir.cc@294 PS1, Line 294: the regression slope and regression intercept Now this information is written twice in this comment: here and starting on L297, which is a bit more detailed. I propose to move the more detailed version here and remove it from around L297. Something like this: "... two arguments of numeric types and return the regression slope of the line and the y-intercept of the regression line respectively." http://gerrit.cloudera.org:8080/#/c/19569/1/be/src/exprs/aggregate-functions-ir.cc@294 PS1, Line 294: returns Nit: return http://gerrit.cloudera.org:8080/#/c/19569/1/be/src/exprs/aggregate-functions-ir.cc@296 PS1, Line 296: pairs which Like in the commit message, "... pairs. They can be used ..." would be clearer. http://gerrit.cloudera.org:8080/#/c/19569/1/be/src/exprs/aggregate-functions-ir.cc@297 PS1, Line 297: // regr_slope() returns the slope of the line while regr_intercept() returns the See comment on L294. http://gerrit.cloudera.org:8080/#/c/19569/1/be/src/exprs/aggregate-functions-ir.cc@303 PS1, Line 303: // // where x and y are the dependent and independent variables respectively. The convention is that 'y' is the dependent variable and 'x' is the independent variable. The first argument of these functions is the dependent variable (which should be 'y'). (See for example https://docs.snowflake.com/en/sql-reference/functions/regr_intercept). We should stick to this convention. Also, the order of the arguments should be clearly described here, something like this: "regr_slope(y, x) = covar_pop(x,y) / var_pop(x) where y is the dependent variable and x is the independent variable." http://gerrit.cloudera.org:8080/#/c/19569/1/be/src/exprs/aggregate-functions-ir.cc@308 PS1, Line 308: n 'n' is not introduced here, it should be "'count' times the variance of ...". Applies also to the following line (L309). http://gerrit.cloudera.org:8080/#/c/19569/1/be/src/exprs/aggregate-functions-ir.cc@317 PS1, Line 317: memset(dst->ptr, 0, dst->len); We could have a reset() member function in RegrSlopeState that sets each variable to 0. http://gerrit.cloudera.org:8080/#/c/19569/1/be/src/exprs/aggregate-functions-ir.cc@341 PS1, Line 341: double deltaX = x - state->xavg; Could move the calculation of the deltas to the else branch, they are only used there. http://gerrit.cloudera.org:8080/#/c/19569/1/be/src/exprs/aggregate-functions-ir.cc@344 PS1, Line 344: memset(state, 0, sizeof(RegrSlopeState)); Could use state->reset(), see L317. http://gerrit.cloudera.org:8080/#/c/19569/1/be/src/exprs/aggregate-functions-ir.cc@418 PS1, Line 418: memcpy(dst_state, src_state, sizeof(RegrSlopeState)); Could use the copy assignment operator (*dst_state = *src_state) instead of mempcy. http://gerrit.cloudera.org:8080/#/c/19569/1/be/src/exprs/aggregate-functions-ir.cc@421 PS1, Line 421: if (nA != 0 && nB != 0) { Could use else if, and then no need for the condition "nA != 0". http://gerrit.cloudera.org:8080/#/c/19569/1/be/src/exprs/aggregate-functions-ir.cc@452 PS1, Line 452: if (r == 0.0) return DoubleVal::null(); This case has already been handled on L448. http://gerrit.cloudera.org:8080/#/c/19569/1/be/src/exprs/aggregate-functions-ir.cc@453 PS1, Line 453: corr Why use 'corr' as the variable name? Wouldn't 'slope' be better? http://gerrit.cloudera.org:8080/#/c/19569/1/be/src/exprs/aggregate-functions-ir.cc@459 PS1, Line 459: state->count == 0 || state->count == 1 These cases are handled in RegrSlopeGetValue(). http://gerrit.cloudera.org:8080/#/c/19569/1/be/src/exprs/aggregate-functions-ir.cc@479 PS1, Line 479: if (r == 0.0) return DoubleVal::null(); Already handled on L474. http://gerrit.cloudera.org:8080/#/c/19569/1/be/src/exprs/aggregate-functions-ir.cc@480 PS1, Line 480: RegrSlope Nit: variable name should start with a lower case letter: regrSlope. http://gerrit.cloudera.org:8080/#/c/19569/1/be/src/exprs/aggregate-functions-ir.cc@488 PS1, Line 488: state->count == 0 || state->count == 1 These cases are handled in RegrInterceptGetValue(). http://gerrit.cloudera.org:8080/#/c/19569/1/be/src/exprs/aggregate-functions-ir.cc@690 PS1, Line 690: var_pop(x) > 0 and var_pop(y) != 0 This condition is superfluous, it follows from the ones above, but in case we keep it: for var_pop(x) we use > 0 and for var_pop(y) we use != 0. Since variances cannot be negative, there is no difference between the two and we should be consistent and use the same for both cases. If we use > 0, we should leave a reminder mentioning that variances are non-negative. http://gerrit.cloudera.org:8080/#/c/19569/1/be/src/exprs/aggregate-functions-ir.cc@691 PS1, Line 691: // where x and y are the dependent and independent variables respectively. See comment on L303 about the independent and dependent variables, it applies here as well. http://gerrit.cloudera.org:8080/#/c/19569/1/be/src/exprs/aggregate-functions-ir.cc@699 PS1, Line 699: state->xvar <= 0.0 || : state->yvar <= 0.0 This is incorrect: this returns NULL if either xvar or yvar is (around) 0, but we should only return NULL if the variance independent variable (which should be x) is 0 and we should return 1 if the dependent variable (which should be y) is 0, like on L707. Note that we still have to account for the dependent variable becoming negative because of floating point rounding, in this case we should treat it as zero and return 1. We should add test cases covering all of the possible cases in the formula. http://gerrit.cloudera.org:8080/#/c/19569/1/be/src/exprs/aggregate-functions-ir.cc@703 PS1, Line 703: r In the comment before the function you use 'r' to signify the result of the regr_r2() function. It is misleading if you use 'r' here for a different value. http://gerrit.cloudera.org:8080/#/c/19569/1/be/src/exprs/aggregate-functions-ir.cc@705 PS1, Line 705: double corr = state->covar / r; The calculation of 'corr' and 'r' should come after dealing with the cases of xvar or yvar being 0, otherwise we don't need it. http://gerrit.cloudera.org:8080/#/c/19569/1/be/src/exprs/aggregate-functions-ir.cc@706 PS1, Line 706: if (state->yvar == 0.0) return DoubleVal::null(); This has already been handled on L699. http://gerrit.cloudera.org:8080/#/c/19569/1/be/src/exprs/aggregate-functions-ir.cc@707 PS1, Line 707: && state->yvar != 0 This has already been checked, no need to include this condition. http://gerrit.cloudera.org:8080/#/c/19569/1/be/src/exprs/aggregate-functions-ir.cc@708 PS1, Line 708: state->xvar > 0 && state->yvar != 0 These cases have already been checked. http://gerrit.cloudera.org:8080/#/c/19569/1/fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java File fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java: http://gerrit.cloudera.org:8080/#/c/19569/1/fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java@1369 PS1, Line 1369: Should add a comment: // Regr_r2() (see for example L1345. http://gerrit.cloudera.org:8080/#/c/19569/1/fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java@1394 PS1, Line 1394: db.addBuiltin(AggregateFunction.createBuiltin(db, "regr_slope", Should add comment: Regr_slope(). http://gerrit.cloudera.org:8080/#/c/19569/1/fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java@1415 PS1, Line 1415: Should add comment: Regr_intercepr(). http://gerrit.cloudera.org:8080/#/c/19569/1/testdata/workloads/functional-query/queries/QueryTest/aggregation.test File testdata/workloads/functional-query/queries/QueryTest/aggregation.test: http://gerrit.cloudera.org:8080/#/c/19569/1/testdata/workloads/functional-query/queries/QueryTest/aggregation.test@2157 PS1, Line 2157: NULL,NULL,NULL Why are these NULLs? Is it correct? http://gerrit.cloudera.org:8080/#/c/19569/1/testdata/workloads/functional-query/queries/QueryTest/aggregation.test@2443 PS1, Line 2443: state->xvar <= 0.0 || state->yvar <= 0.0 See L699 in aggregate-functions-ir.cc about why we can't handle these two cases together. We should add tests that demonstrate that we handle these cases correctly. -- To view, visit http://gerrit.cloudera.org:8080/19569 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iab6bd84ae3e0c02ec924c30183308123b951caa3 Gerrit-Change-Number: 19569 Gerrit-PatchSet: 1 Gerrit-Owner: Anonymous Coward <[email protected]> Gerrit-Reviewer: Anonymous Coward <[email protected]> Gerrit-Reviewer: Daniel Becker <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Quanlong Huang <[email protected]> Gerrit-Comment-Date: Mon, 06 Mar 2023 13:16:02 +0000 Gerrit-HasComments: Yes
