[Impala-ASF-CR] IMPALA-11957: Implement Regression functions: regr slope(), regr intercept() and regr r2()

Daniel Becker (Code Review) Mon, 06 Mar 2023 05:21:17 -0800

Daniel Becker has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19569 )


Change subject: IMPALA-11957: Implement Regression functions: regr_slope(), 
regr_intercept() and regr_r2()
......................................................................


Patch Set 1:

(34 comments)

Thanks Pranav!

I wonder if we could create some script that tested these functions against 
some other database which we would like to be compatible with, for example 
Hive. We should run the same queries in both engines and expect the same 
results. I'm not sure we can insert it into our test framework but it would be 
good if we could at least manually run a test like this before adding these 
functions.

The tests in aggregation.test would be a good starting point for the queries to 
test against Hive (or Postrgesql etc.).

http://gerrit.cloudera.org:8080/#/c/19569/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/19569/1//COMMIT_MSG@9
PS1, Line 9: The linear regression functions fit an ordinary-least-squares 
regression line
Lines in the commit message should be at most 72 chars long.


http://gerrit.cloudera.org:8080/#/c/19569/1//COMMIT_MSG@10
PS1, Line 10: pairs which
It would be clearer this way:
"... pairs. They can be used ...".


http://gerrit.cloudera.org:8080/#/c/19569/1//COMMIT_MSG@16
PS1, Line 16: of determination (also called R-squared or goodness of fit) for 
the regression.
Please add a "Testing".


http://gerrit.cloudera.org:8080/#/c/19569/1/be/src/exprs/aggregate-functions-ir.cc
File be/src/exprs/aggregate-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/19569/1/be/src/exprs/aggregate-functions-ir.cc@291
PS1, Line 291: // Implementation of regr_slope() and regr_intercept():
Did you use some source for either the implementation or the specification of 
these regression functions (i.e. some description you based your implementation 
on)? If so, please reference it here.


http://gerrit.cloudera.org:8080/#/c/19569/1/be/src/exprs/aggregate-functions-ir.cc@294
PS1, Line 294: the regression slope and regression intercept
Now this information is written twice in this comment: here and starting on 
L297, which is a bit more detailed. I propose to move the more detailed version 
here and remove it from around L297.

Something like this:
"... two arguments of numeric types and return the regression slope of the line 
and the y-intercept of the regression line respectively."


http://gerrit.cloudera.org:8080/#/c/19569/1/be/src/exprs/aggregate-functions-ir.cc@294
PS1, Line 294: returns
Nit: return


http://gerrit.cloudera.org:8080/#/c/19569/1/be/src/exprs/aggregate-functions-ir.cc@296
PS1, Line 296: pairs which
Like in the commit message, "... pairs. They can be used ..." would be clearer.


http://gerrit.cloudera.org:8080/#/c/19569/1/be/src/exprs/aggregate-functions-ir.cc@297
PS1, Line 297: // regr_slope() returns the slope of the line while 
regr_intercept() returns the
See comment on L294.


http://gerrit.cloudera.org:8080/#/c/19569/1/be/src/exprs/aggregate-functions-ir.cc@303
PS1, Line 303: // // where x and y are the dependent and independent variables 
respectively.
The convention is that 'y' is the dependent variable and 'x' is the independent 
variable. The first argument of these functions is the dependent variable 
(which should be 'y'). (See for example 
https://docs.snowflake.com/en/sql-reference/functions/regr_intercept).

We should stick to this convention.

Also, the order of the arguments should be clearly described here, something 
like this:

"regr_slope(y, x) = covar_pop(x,y) / var_pop(x)
where y is the dependent variable and x is the independent variable."


http://gerrit.cloudera.org:8080/#/c/19569/1/be/src/exprs/aggregate-functions-ir.cc@308
PS1, Line 308: n
'n' is not introduced here, it should be "'count' times the variance of ...".
Applies also to the following line (L309).


http://gerrit.cloudera.org:8080/#/c/19569/1/be/src/exprs/aggregate-functions-ir.cc@317
PS1, Line 317:   memset(dst->ptr, 0, dst->len);
We could have a reset() member function in RegrSlopeState that sets each 
variable to 0.


http://gerrit.cloudera.org:8080/#/c/19569/1/be/src/exprs/aggregate-functions-ir.cc@341
PS1, Line 341:   double deltaX = x - state->xavg;
Could move the calculation of the deltas to the else branch, they are only used 
there.


http://gerrit.cloudera.org:8080/#/c/19569/1/be/src/exprs/aggregate-functions-ir.cc@344
PS1, Line 344:     memset(state, 0, sizeof(RegrSlopeState));
Could use state->reset(), see L317.


http://gerrit.cloudera.org:8080/#/c/19569/1/be/src/exprs/aggregate-functions-ir.cc@418
PS1, Line 418:       memcpy(dst_state, src_state, sizeof(RegrSlopeState));
Could use the copy assignment operator (*dst_state = *src_state) instead of 
mempcy.


http://gerrit.cloudera.org:8080/#/c/19569/1/be/src/exprs/aggregate-functions-ir.cc@421
PS1, Line 421:     if (nA != 0 && nB != 0) {
Could use else if, and then no need for the condition "nA != 0".


http://gerrit.cloudera.org:8080/#/c/19569/1/be/src/exprs/aggregate-functions-ir.cc@452
PS1, Line 452:   if (r == 0.0) return DoubleVal::null();
This case has already been handled on L448.


http://gerrit.cloudera.org:8080/#/c/19569/1/be/src/exprs/aggregate-functions-ir.cc@453
PS1, Line 453: corr
Why use 'corr' as the variable name? Wouldn't 'slope' be better?


http://gerrit.cloudera.org:8080/#/c/19569/1/be/src/exprs/aggregate-functions-ir.cc@459
PS1, Line 459: state->count == 0 || state->count == 1
These cases are handled in RegrSlopeGetValue().


http://gerrit.cloudera.org:8080/#/c/19569/1/be/src/exprs/aggregate-functions-ir.cc@479
PS1, Line 479:   if (r == 0.0) return DoubleVal::null();
Already handled on L474.


http://gerrit.cloudera.org:8080/#/c/19569/1/be/src/exprs/aggregate-functions-ir.cc@480
PS1, Line 480: RegrSlope
Nit: variable name should start with a lower case letter: regrSlope.


http://gerrit.cloudera.org:8080/#/c/19569/1/be/src/exprs/aggregate-functions-ir.cc@488
PS1, Line 488: state->count == 0 || state->count == 1
These cases are handled in RegrInterceptGetValue().


http://gerrit.cloudera.org:8080/#/c/19569/1/be/src/exprs/aggregate-functions-ir.cc@690
PS1, Line 690: var_pop(x) > 0 and var_pop(y)  != 0
This condition is superfluous, it follows from the ones above, but in case we 
keep it: for var_pop(x) we use > 0 and for var_pop(y) we use != 0. Since 
variances cannot be negative, there is no difference between the two and we 
should be consistent and use the same for both cases.
If we use > 0, we should leave a reminder mentioning that variances are 
non-negative.


http://gerrit.cloudera.org:8080/#/c/19569/1/be/src/exprs/aggregate-functions-ir.cc@691
PS1, Line 691: // where x and y are the dependent and independent variables 
respectively.
See comment on L303 about the independent and dependent variables, it applies 
here as well.


http://gerrit.cloudera.org:8080/#/c/19569/1/be/src/exprs/aggregate-functions-ir.cc@699
PS1, Line 699: state->xvar <= 0.0 ||
             :       state->yvar <= 0.0
This is incorrect: this returns NULL if either xvar or yvar is (around) 0, but 
we should only return NULL if the variance independent variable (which should 
be x) is 0 and we should return 1 if the dependent variable (which should be y) 
is 0, like on L707.
Note that we still have to account for the dependent variable becoming negative 
because of floating point rounding, in this case we should treat it as zero and 
return 1.

We should add test cases covering all of the possible cases in the formula.


http://gerrit.cloudera.org:8080/#/c/19569/1/be/src/exprs/aggregate-functions-ir.cc@703
PS1, Line 703: r
In the comment before the function you use 'r' to signify the result of the 
regr_r2() function. It is misleading if you use 'r' here for a different value.


http://gerrit.cloudera.org:8080/#/c/19569/1/be/src/exprs/aggregate-functions-ir.cc@705
PS1, Line 705:   double corr = state->covar / r;
The calculation of 'corr' and 'r' should come after dealing with the cases of 
xvar or yvar being 0, otherwise we don't need it.


http://gerrit.cloudera.org:8080/#/c/19569/1/be/src/exprs/aggregate-functions-ir.cc@706
PS1, Line 706:   if (state->yvar == 0.0) return DoubleVal::null();
This has already been handled on L699.


http://gerrit.cloudera.org:8080/#/c/19569/1/be/src/exprs/aggregate-functions-ir.cc@707
PS1, Line 707: && state->yvar != 0
This has already been checked, no need to include this condition.


http://gerrit.cloudera.org:8080/#/c/19569/1/be/src/exprs/aggregate-functions-ir.cc@708
PS1, Line 708: state->xvar > 0 && state->yvar != 0
These cases have already been checked.


http://gerrit.cloudera.org:8080/#/c/19569/1/fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
File fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java:

http://gerrit.cloudera.org:8080/#/c/19569/1/fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java@1369
PS1, Line 1369:
Should add a comment: // Regr_r2() (see for example L1345.


http://gerrit.cloudera.org:8080/#/c/19569/1/fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java@1394
PS1, Line 1394:     db.addBuiltin(AggregateFunction.createBuiltin(db, 
"regr_slope",
Should add comment: Regr_slope().


http://gerrit.cloudera.org:8080/#/c/19569/1/fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java@1415
PS1, Line 1415:
Should add comment: Regr_intercepr().


http://gerrit.cloudera.org:8080/#/c/19569/1/testdata/workloads/functional-query/queries/QueryTest/aggregation.test
File testdata/workloads/functional-query/queries/QueryTest/aggregation.test:

http://gerrit.cloudera.org:8080/#/c/19569/1/testdata/workloads/functional-query/queries/QueryTest/aggregation.test@2157
PS1, Line 2157: NULL,NULL,NULL
Why are these NULLs? Is it correct?


http://gerrit.cloudera.org:8080/#/c/19569/1/testdata/workloads/functional-query/queries/QueryTest/aggregation.test@2443
PS1, Line 2443: state->xvar <= 0.0 || state->yvar <= 0.0
See L699 in aggregate-functions-ir.cc about why we can't handle these two cases 
together.

We should add tests that demonstrate that we handle these cases correctly.



--
To view, visit http://gerrit.cloudera.org:8080/19569
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iab6bd84ae3e0c02ec924c30183308123b951caa3
Gerrit-Change-Number: 19569
Gerrit-PatchSet: 1
Gerrit-Owner: Anonymous Coward <[email protected]>
Gerrit-Reviewer: Anonymous Coward <[email protected]>
Gerrit-Reviewer: Daniel Becker <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Quanlong Huang <[email protected]>
Gerrit-Comment-Date: Mon, 06 Mar 2023 13:16:02 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-11957: Implement Regression functions: regr slope(), regr intercept() and regr r2()

Reply via email to