[email protected] has posted comments on this change. ( http://gerrit.cloudera.org:8080/19569 )
Change subject: IMPALA-11957: Implement Regression functions: regr_slope(), regr_intercept() and regr_r2() ...................................................................... Patch Set 4: (54 comments) > Patch Set 4: > > (1 comment) A few more tests are yet to be added. http://gerrit.cloudera.org:8080/#/c/19569/2//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/19569/2//COMMIT_MSG@18 PS2, Line 18: > You could have a newline after the ':'. Done http://gerrit.cloudera.org:8080/#/c/19569/2//COMMIT_MSG@18 PS2, Line 18: > Nit: Hive (capital H). Done http://gerrit.cloudera.org:8080/#/c/19569/1/be/src/exprs/aggregate-functions-ir.cc File be/src/exprs/aggregate-functions-ir.cc: http://gerrit.cloudera.org:8080/#/c/19569/1/be/src/exprs/aggregate-functions-ir.cc@291 PS1, Line 291: // Implementation of regr_slope() and regr_intercept(): > Did you use the Oracle link as specification? Did you take the implementati Yes, I've used the Oracle link just for specification, couldn't find these exact formulas on any paper but their constituent parts like xvar, yvar, covar etc are mentioned in papers and have a similar implementation as correlation and covariance functions. http://gerrit.cloudera.org:8080/#/c/19569/1/be/src/exprs/aggregate-functions-ir.cc@294 PS1, Line 294: he regression slope of the line and the y-int > Done Done http://gerrit.cloudera.org:8080/#/c/19569/1/be/src/exprs/aggregate-functions-ir.cc@294 PS1, Line 294: return > Nit: return Done http://gerrit.cloudera.org:8080/#/c/19569/1/be/src/exprs/aggregate-functions-ir.cc@296 PS1, Line 296: a set of nu > Done Done http://gerrit.cloudera.org:8080/#/c/19569/1/be/src/exprs/aggregate-functions-ir.cc@297 PS1, Line 297: // analytic functions. > Done Done http://gerrit.cloudera.org:8080/#/c/19569/1/be/src/exprs/aggregate-functions-ir.cc@303 PS1, Line 303: // regr_intercept() formula used: > Done Done http://gerrit.cloudera.org:8080/#/c/19569/1/be/src/exprs/aggregate-functions-ir.cc@308 PS1, Line 308: a > Done Done http://gerrit.cloudera.org:8080/#/c/19569/1/be/src/exprs/aggregate-functions-ir.cc@317 PS1, Line 317: dst->len = sizeof(RegrSlopeState); > Not addressed yet. As discussed, to maintain uniformity across all the other aggregate functions, it'd be better to stick with memset(). Open for discussion. http://gerrit.cloudera.org:8080/#/c/19569/1/be/src/exprs/aggregate-functions-ir.cc@341 PS1, Line 341: > Done Done http://gerrit.cloudera.org:8080/#/c/19569/1/be/src/exprs/aggregate-functions-ir.cc@344 PS1, Line 344: memset(state, 0, sizeof(RegrSlopeState)); > Not addressed yet. As discussed, to maintain uniformity across all the other aggregate functions, it'd be better to stick with memset(). Open for discussion. http://gerrit.cloudera.org:8080/#/c/19569/1/be/src/exprs/aggregate-functions-ir.cc@418 PS1, Line 418: int64_t nB = src_state->count; > Done Done http://gerrit.cloudera.org:8080/#/c/19569/1/be/src/exprs/aggregate-functions-ir.cc@421 PS1, Line 421: return; > Could use else if, and then no need for the condition "nA != 0". Done http://gerrit.cloudera.org:8080/#/c/19569/1/be/src/exprs/aggregate-functions-ir.cc@442 PS1, Line 442: > line too long (93 > 90) Done http://gerrit.cloudera.org:8080/#/c/19569/1/be/src/exprs/aggregate-functions-ir.cc@452 PS1, Line 452: } > Done Done http://gerrit.cloudera.org:8080/#/c/19569/1/be/src/exprs/aggregate-functions-ir.cc@453 PS1, Line 453: regr > Done Done http://gerrit.cloudera.org:8080/#/c/19569/1/be/src/exprs/aggregate-functions-ir.cc@457 PS1, Line 457: DoubleVal AggregateFunctions::RegrSlopeFinalize(FunctionContext* ctx, > line too long (93 > 90) Done http://gerrit.cloudera.org:8080/#/c/19569/1/be/src/exprs/aggregate-functions-ir.cc@459 PS1, Line 459: > Done Done http://gerrit.cloudera.org:8080/#/c/19569/1/be/src/exprs/aggregate-functions-ir.cc@468 PS1, Line 468: DoubleVal AggregateFunctions::RegrInterceptGetValue(FunctionContext* ctx, > line too long (97 > 90) Done http://gerrit.cloudera.org:8080/#/c/19569/1/be/src/exprs/aggregate-functions-ir.cc@479 PS1, Line 479: double regrIntercept = state->yavg - (regrSlope * state->xavg); > Done Done http://gerrit.cloudera.org:8080/#/c/19569/1/be/src/exprs/aggregate-functions-ir.cc@480 PS1, Line 480: DoubleVal > Done Done http://gerrit.cloudera.org:8080/#/c/19569/1/be/src/exprs/aggregate-functions-ir.cc@486 PS1, Line 486: ctx->Free(src.ptr); > line too long (97 > 90) Done http://gerrit.cloudera.org:8080/#/c/19569/1/be/src/exprs/aggregate-functions-ir.cc@488 PS1, Line 488: > Done Done http://gerrit.cloudera.org:8080/#/c/19569/1/be/src/exprs/aggregate-functions-ir.cc@690 PS1, Line 690: > This condition is superfluous, it follows from the ones above, but in case Done http://gerrit.cloudera.org:8080/#/c/19569/1/be/src/exprs/aggregate-functions-ir.cc@691 PS1, Line 691: DoubleVal AggregateFunctions::Regr_r2GetValue(FunctionContext* ctx, > See comment on L303 about the independent and dependent variables, it appli Done http://gerrit.cloudera.org:8080/#/c/19569/1/be/src/exprs/aggregate-functions-ir.cc@692 PS1, Line 692: const StringVal& src) { > line too long (91 > 90) Done http://gerrit.cloudera.org:8080/#/c/19569/1/be/src/exprs/aggregate-functions-ir.cc@699 PS1, Line 699: state->xvar <= 0.0 || : state->yvar <= 0.0 > Not addressed yet. It is important that we create tests for this case. Under process, I ran a bunch of queries but couldn't find a case where yvar becomes negative, am still looking for it and will update the new patch with it. http://gerrit.cloudera.org:8080/#/c/19569/1/be/src/exprs/aggregate-functions-ir.cc@703 PS1, Line 703: n > Please find a meaningful name for 'r'. I think r_squared should make sense. Please let me know if we should change it to something else. http://gerrit.cloudera.org:8080/#/c/19569/1/be/src/exprs/aggregate-functions-ir.cc@705 PS1, Line 705: double r_squared = state->xvar * state->yvar; > Done Done http://gerrit.cloudera.org:8080/#/c/19569/1/be/src/exprs/aggregate-functions-ir.cc@706 PS1, Line 706: // Returns null if xvar and yvar are very small and their product > Done Done http://gerrit.cloudera.org:8080/#/c/19569/1/be/src/exprs/aggregate-functions-ir.cc@707 PS1, Line 707: e of floating point > Done Done http://gerrit.cloudera.org:8080/#/c/19569/1/be/src/exprs/aggregate-functions-ir.cc@708 PS1, Line 708: quared == 0.0) return DoubleVal::nu > Done Done http://gerrit.cloudera.org:8080/#/c/19569/1/be/src/exprs/aggregate-functions-ir.cc@712 PS1, Line 712: DoubleVal AggregateFunctions::Regr_r2Finalize(FunctionContext* ctx, > line too long (91 > 90) Done http://gerrit.cloudera.org:8080/#/c/19569/2/be/src/exprs/aggregate-functions-ir.cc File be/src/exprs/aggregate-functions-ir.cc: http://gerrit.cloudera.org:8080/#/c/19569/2/be/src/exprs/aggregate-functions-ir.cc@299 PS2, Line 299: // https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/REGR_-Linear-Regression-Functions.html#GUID-A675B68F-2A88-4843-BE2C-FCDE9C65F9A9 > I think URLs should be an exception to the line width rule as if they are b Done http://gerrit.cloudera.org:8080/#/c/19569/2/be/src/exprs/aggregate-functions-ir.cc@301 PS2, Line 301: // regr_slope() formula used: > Nit: add a newline after the link. Done http://gerrit.cloudera.org:8080/#/c/19569/2/be/src/exprs/aggregate-functions-ir.cc@302 PS2, Line 302: x, y > Although covar_pop is commutative, it's more conventional to write covar_po Done http://gerrit.cloudera.org:8080/#/c/19569/2/be/src/exprs/aggregate-functions-ir.cc@304 PS2, Line 304: regr_intercept > Should be 'regr_intercepr(y, x)'. Done http://gerrit.cloudera.org:8080/#/c/19569/2/be/src/exprs/aggregate-functions-ir.cc@305 PS2, Line 305: wh > Nit: no need for these slashes. Done http://gerrit.cloudera.org:8080/#/c/19569/2/be/src/exprs/aggregate-functions-ir.cc@423 PS2, Line 423: double yavgA = dst_state->yavg; > Start the 'else if' on the previous line after the '}'. Done http://gerrit.cloudera.org:8080/#/c/19569/2/be/src/exprs/aggregate-functions-ir.cc@423 PS2, Line 423: yavgA = > No need for this as if nA==0 then we go into the IF-THEN block, not the ELS Done http://gerrit.cloudera.org:8080/#/c/19569/2/be/src/exprs/aggregate-functions-ir.cc@460 PS2, Line 460: ctx->Free(src.ptr); > Nit: unnecessary line. Done http://gerrit.cloudera.org:8080/#/c/19569/2/be/src/exprs/aggregate-functions-ir.cc@488 PS2, Line 488: } > Nit: unnecessary line. Done http://gerrit.cloudera.org:8080/#/c/19569/2/be/src/exprs/aggregate-functions-ir.cc@689 PS2, Line 689: Note th > Should be 'regr_r2(y, x)', also for the next two lines. Done http://gerrit.cloudera.org:8080/#/c/19569/2/be/src/exprs/aggregate-functions-ir.cc@691 PS2, Line 691: : const StringVal& > If we keep this: Done http://gerrit.cloudera.org:8080/#/c/19569/2/be/src/exprs/aggregate-functions-ir.cc@706 PS2, Line 706: l > This should be 'yvar', see the formula. Done http://gerrit.cloudera.org:8080/#/c/19569/2/be/src/exprs/aggregate-functions-ir.cc@706 PS2, Line 706: // > Start it on the previous line, after the '}'. Done http://gerrit.cloudera.org:8080/#/c/19569/2/be/src/exprs/aggregate-functions-ir.cc@706 PS2, Line 706: nd yvar > If we have an ELSE or ELSE IF branch, we always use blocks, i.e. Noted, done! http://gerrit.cloudera.org:8080/#/c/19569/2/be/src/exprs/aggregate-functions-ir.cc@708 PS2, Line 708: if (r_squared == 0.0) return DoubleVal::null(); > Am I right that the formula should be Makes sense, updated! http://gerrit.cloudera.org:8080/#/c/19569/1/fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java File fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java: http://gerrit.cloudera.org:8080/#/c/19569/1/fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java@1369 PS1, Line 1369: > Not addressed yet. Done http://gerrit.cloudera.org:8080/#/c/19569/1/fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java@1394 PS1, Line 1394: > Not addressed yet. Done http://gerrit.cloudera.org:8080/#/c/19569/1/fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java@1415 PS1, Line 1415: prefix + "17RegrSlopeFinalizeEPN10impala_udf15FunctionContextERKNS1_9StringValE", > Not addressed yet. Done http://gerrit.cloudera.org:8080/#/c/19569/1/testdata/workloads/functional-query/queries/QueryTest/aggregation.test File testdata/workloads/functional-query/queries/QueryTest/aggregation.test: http://gerrit.cloudera.org:8080/#/c/19569/1/testdata/workloads/functional-query/queries/QueryTest/aggregation.test@2157 PS1, Line 2157: NULL,NULL,NULL > Why are these NULLs? Is it correct? Yes, the results match with hive. http://gerrit.cloudera.org:8080/#/c/19569/1/testdata/workloads/functional-query/queries/QueryTest/aggregation.test@2443 PS1, Line 2443: state->xvar <= 0.0 || state->yvar <= 0.0 > See L699 in aggregate-functions-ir.cc about why we can't handle these two c Under process and discussion. -- To view, visit http://gerrit.cloudera.org:8080/19569 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iab6bd84ae3e0c02ec924c30183308123b951caa3 Gerrit-Change-Number: 19569 Gerrit-PatchSet: 4 Gerrit-Owner: Anonymous Coward <[email protected]> Gerrit-Reviewer: Anonymous Coward <[email protected]> Gerrit-Reviewer: Daniel Becker <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Quanlong Huang <[email protected]> Gerrit-Comment-Date: Tue, 04 Apr 2023 21:23:26 +0000 Gerrit-HasComments: Yes
