Daniel Becker has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20345 )

Change subject: IMPALA-12361: Implementation of regr_count(), regr_avgx() and 
regr_avgy()
......................................................................


Patch Set 2:

(51 comments)

Thanks Pranav!

http://gerrit.cloudera.org:8080/#/c/20345/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/20345/2//COMMIT_MSG@9
PS2, Line 9: The linear regression functions fit an ordinary-least-squares 
regression
           : line to a set of number pairs.
These functions (regr_count(), regr_avgx() and regr_avgy()) don't fit a 
regression line.


http://gerrit.cloudera.org:8080/#/c/20345/2//COMMIT_MSG@13
PS2, Line 13: regr_count()
regr_count(y, x) would be better.


http://gerrit.cloudera.org:8080/#/c/20345/2//COMMIT_MSG@14
PS2, Line 14: used to fit the regression line.
It doesn't fit a regression line. We could simply say that it returns the 
number of non-null pairs.


http://gerrit.cloudera.org:8080/#/c/20345/2//COMMIT_MSG@15
PS2, Line 15: e
Add space before "(x)".


http://gerrit.cloudera.org:8080/#/c/20345/2//COMMIT_MSG@15
PS2, Line 15: regr_avgx()
Should be regr_avgx(y, x).


http://gerrit.cloudera.org:8080/#/c/20345/2//COMMIT_MSG@16
PS2, Line 16: of the regression line
There is not necessarily any regression line fitting involved, I'd leave this 
out.


http://gerrit.cloudera.org:8080/#/c/20345/2//COMMIT_MSG@16
PS2, Line 16: It computes avg(expr2) after the
            : elimination of null (expr1, expr2).
expr1 and expr2 are not defined here, we should not use these. I think it would 
be better if the first sentence contained the part about eliminating NULLs, 
something like this:

regr_avgx(y, x) computes the average of the independent variable (x) after 
eliminating pairs containing NULL.


http://gerrit.cloudera.org:8080/#/c/20345/2//COMMIT_MSG@18
PS2, Line 18: regr_avgy() evaluates the average of the dependent variable(y)
See comments for regr_avgx().


http://gerrit.cloudera.org:8080/#/c/20345/2/be/src/exprs/aggregate-functions-ir.cc
File be/src/exprs/aggregate-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/20345/2/be/src/exprs/aggregate-functions-ir.cc@284
PS2, Line 284: r
Add space after //.


http://gerrit.cloudera.org:8080/#/c/20345/2/be/src/exprs/aggregate-functions-ir.cc@284
PS2, Line 284: ()
Should be regr_count(y, x).


http://gerrit.cloudera.org:8080/#/c/20345/2/be/src/exprs/aggregate-functions-ir.cc@284
PS2, Line 284: an integer that is
I don't think this is useful, just makes the sentence more complicated.


http://gerrit.cloudera.org:8080/#/c/20345/2/be/src/exprs/aggregate-functions-ir.cc@284
PS2, Line 284: used to fit
             : // the regression line.
See the comment in the commit message, I don't think it's necessary to refer to 
a regression line here.


http://gerrit.cloudera.org:8080/#/c/20345/2/be/src/exprs/aggregate-functions-ir.cc@289
PS2, Line 289:   if (src1.is_null || src2.is_null) return;
I think

if (!src1.is_null && !src2.is_null) ++dst->val;

would be cleaner.


http://gerrit.cloudera.org:8080/#/c/20345/2/be/src/exprs/aggregate-functions-ir.cc@296
PS2, Line 296: if (src1.is_null || src2.is_null) return;
             :   --dst->val;
I think

if (!src1.is_null && !src2.is_null) --dst->val;

would be cleaner.


http://gerrit.cloudera.org:8080/#/c/20345/2/be/src/exprs/aggregate-functions-ir.cc@304
PS2, Line 304: !(src1.is_null && src2.is_null)
This will return true for pairs where only one element is non-null. It should 
be !src1.is_null && !src2.is_null.
Please find a test where it fails and add that test to the others.


http://gerrit.cloudera.org:8080/#/c/20345/2/be/src/exprs/aggregate-functions-ir.cc@313
PS2, Line 313:   if (!(src1.is_null && src2.is_null)) {
See L304.


http://gerrit.cloudera.org:8080/#/c/20345/2/be/src/exprs/aggregate-functions-ir.cc@326
PS2, Line 326: //regr_avgx() evaluates the average of the independent 
variable(x) of the regression
See comment about the wording in the commit message.


http://gerrit.cloudera.org:8080/#/c/20345/2/be/src/exprs/aggregate-functions-ir.cc@326
PS2, Line 326: /r
Add space.


http://gerrit.cloudera.org:8080/#/c/20345/2/be/src/exprs/aggregate-functions-ir.cc@328
PS2, Line 328: /r
Add space.


http://gerrit.cloudera.org:8080/#/c/20345/2/be/src/exprs/aggregate-functions-ir.cc@328
PS2, Line 328: //regr_avgy() evaluates the average of the dependent variable(y) 
of the regression
See comment about the wording in the commit message.


http://gerrit.cloudera.org:8080/#/c/20345/2/be/src/exprs/aggregate-functions-ir.cc@330
PS2, Line 330: RegrAvgState
Can't we use AvgState instead? That struct is essentially the same, except that 
it counts the sum instead of a running average.

Is there a reason why you count a running average here instead of counting the 
sum and dividing it in the Finalize*() function like 'avg()'? I think that's 
more efficient and less prone to rounding errors.


http://gerrit.cloudera.org:8080/#/c/20345/2/be/src/exprs/aggregate-functions-ir.cc@330
PS2, Line 330: struct RegrAvgState {
We should mention that this struct is used to calculate the above functions, 
it's not really good if the descriptions just appear there.


http://gerrit.cloudera.org:8080/#/c/20345/2/be/src/exprs/aggregate-functions-ir.cc@336
PS2, Line 336:   dst->is_null = false;
If 'AllocBuffer()' below stays, it will set dst->is_null so we don't need to do 
it here.


http://gerrit.cloudera.org:8080/#/c/20345/2/be/src/exprs/aggregate-functions-ir.cc@338
PS2, Line 338:   AllocBuffer(ctx, dst, dst->len);
AvgInit() doesn't allocate memory, it uses a fixed length buffer, see AvgInit() 
and the "// Avg" section in BuiltinsDb.java. We could do the same.


http://gerrit.cloudera.org:8080/#/c/20345/2/be/src/exprs/aggregate-functions-ir.cc@346
PS2, Line 346: double y
It is enough to take one 'double' parameter, we never use 'y'.


http://gerrit.cloudera.org:8080/#/c/20345/2/be/src/exprs/aggregate-functions-ir.cc@354
PS2, Line 354: double y
It is enough to take one 'double' parameter, we never use 'y'.


http://gerrit.cloudera.org:8080/#/c/20345/2/be/src/exprs/aggregate-functions-ir.cc@355
PS2, Line 355:   if (state->count <= 1) {
In several places you have the comment that 'Remove... doesn't need to 
explicitly check the number of calls to Update() or Remove() ..." (for example 
L376) but you do check it here. A DCHECK that the count is not less than 0 
would be enough.


http://gerrit.cloudera.org:8080/#/c/20345/2/be/src/exprs/aggregate-functions-ir.cc@365
PS2, Line 365: RegrAvgUpdate
If we have 'RegrAvgYUpdate()', this should be 'RegrAvgXUpdate().


http://gerrit.cloudera.org:8080/#/c/20345/2/be/src/exprs/aggregate-functions-ir.cc@374
PS2, Line 374: void AggregateFunctions::RegrAvgRemove(FunctionContext* ctx,
If we have 'RegrAvgYRemove()', this should be 'RegrAvgXRemove().


http://gerrit.cloudera.org:8080/#/c/20345/2/be/src/exprs/aggregate-functions-ir.cc@386
PS2, Line 386: TimestampRegrAvgUpdate
If we have 'TimestampRegrAvgYUpdate()', this should be 
'TimestampRegrAvgXUpdate()'.


http://gerrit.cloudera.org:8080/#/c/20345/2/be/src/exprs/aggregate-functions-ir.cc@388
PS2, Line 388:   if (src1.is_null || src2.is_null) return;
You could have the same DCHECKs here that you have in RegrAvgRemove.


http://gerrit.cloudera.org:8080/#/c/20345/2/be/src/exprs/aggregate-functions-ir.cc@399
PS2, Line 399: TimestampRegrAvgRemove
If we have 'TimestampRegrAvgYRemove()', this should be 
'TimestampRegrAvgXRemove()'.


http://gerrit.cloudera.org:8080/#/c/20345/2/be/src/exprs/aggregate-functions-ir.cc@404
PS2, Line 404:   if (src1.is_null || src2.is_null) return;
You could have the same DCHECKs here that you have in RegrAvgRemove.


http://gerrit.cloudera.org:8080/#/c/20345/2/be/src/exprs/aggregate-functions-ir.cc@417
PS2, Line 417:   RegrAvgState* src_state = 
reinterpret_cast<RegrAvgState*>(src.ptr);
You could also DCHECK the size of src before casting it.


http://gerrit.cloudera.org:8080/#/c/20345/2/be/src/exprs/aggregate-functions-ir.cc@423
PS2, Line 423:     if (nA == 0) {
If we go over to counting the sum instead of a running avg in the state, this 
special case will probably not be needed because we won't need to divide by 
'dst_state->count'.


http://gerrit.cloudera.org:8080/#/c/20345/2/be/src/exprs/aggregate-functions-ir.cc@1006
PS2, Line 1006:
Unneeded extra spaces.


http://gerrit.cloudera.org:8080/#/c/20345/2/be/src/exprs/aggregate-functions-ir.cc@1013
PS2, Line 1013:
Unneeded extra spaces.


http://gerrit.cloudera.org:8080/#/c/20345/2/be/src/exprs/aggregate-functions-ir.cc@1019
PS2, Line 1019:
Unneeded extra spaces.


http://gerrit.cloudera.org:8080/#/c/20345/2/be/src/exprs/aggregate-functions-ir.cc@1047
PS2, Line 1047:   
Unneeded extra spaces.


http://gerrit.cloudera.org:8080/#/c/20345/2/be/src/exprs/aggregate-functions-ir.cc@1143
PS2, Line 1143:
Unneeded extra spaces.


http://gerrit.cloudera.org:8080/#/c/20345/2/be/src/exprs/aggregate-functions-ir.cc@1149
PS2, Line 1149:
Unneeded extra spaces.


http://gerrit.cloudera.org:8080/#/c/20345/2/be/src/exprs/aggregate-functions-ir.cc@1154
PS2, Line 1154:
Unneeded extra spaces.


http://gerrit.cloudera.org:8080/#/c/20345/2/be/src/exprs/aggregate-functions-ir.cc@1554
PS2, Line 1554:
Unneeded extra spaces.
Usually we don't align operands, we have an indentation of a fixed number of 4 
spaces. Here the operands were aligned, which also makes sense, it's just not 
our convention. But aligning and then adding 4 spaces is not useful.

So we should either leave this line as it was before this change OR remove 
spaces so that the indentation is 4 spaces.


http://gerrit.cloudera.org:8080/#/c/20345/2/be/src/exprs/aggregate-functions-ir.cc@2935
PS2, Line 2935:
Unneeded extra spaces: we don't usually align operands.


http://gerrit.cloudera.org:8080/#/c/20345/2/be/src/exprs/aggregate-functions-ir.cc@2999
PS2, Line 2999:                     
Unneeded extra spaces: we don't usually align operands.


http://gerrit.cloudera.org:8080/#/c/20345/2/bin/cmake_aux/create_py3_virtualenv.sh
File bin/cmake_aux/create_py3_virtualenv.sh:

http://gerrit.cloudera.org:8080/#/c/20345/2/bin/cmake_aux/create_py3_virtualenv.sh@45
PS2, Line 45:   # Failure
Do these changes come from a rebase?


http://gerrit.cloudera.org:8080/#/c/20345/2/testdata/workloads/functional-query/queries/QueryTest/aggregation.test
File testdata/workloads/functional-query/queries/QueryTest/aggregation.test:

http://gerrit.cloudera.org:8080/#/c/20345/2/testdata/workloads/functional-query/queries/QueryTest/aggregation.test@2084
PS2, Line 2084: null
Does it / should it return null? Shouldn't it be 1?


http://gerrit.cloudera.org:8080/#/c/20345/2/testdata/workloads/functional-query/queries/QueryTest/aggregation.test@2293
PS2, Line 2293: avgy
Should be regr_avgy().


http://gerrit.cloudera.org:8080/#/c/20345/2/testdata/workloads/functional-query/queries/QueryTest/aggregation.test@2317
PS2, Line 2317: null
Shouldn't they return the value in the single row?


http://gerrit.cloudera.org:8080/#/c/20345/2/testdata/workloads/functional-query/queries/QueryTest/aggregation.test@2460
PS2, Line 2460: regrAvgRemove
Shouldn't it be 'regrAvgXRemove()'?


http://gerrit.cloudera.org:8080/#/c/20345/2/testdata/workloads/functional-query/queries/QueryTest/aggregation.test@2480
PS2, Line 2480: regrAvgRemove
Isn't it 'regrAvgYRemove()'?



--
To view, visit http://gerrit.cloudera.org:8080/20345
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia17c565758565b868d3a54b3ebc12da51c37e143
Gerrit-Change-Number: 20345
Gerrit-PatchSet: 2
Gerrit-Owner: Anonymous Coward <[email protected]>
Gerrit-Reviewer: Daniel Becker <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Joe McDonnell <[email protected]>
Gerrit-Comment-Date: Tue, 05 Sep 2023 12:42:05 +0000
Gerrit-HasComments: Yes

Reply via email to