Daniel Becker has posted comments on this change. ( http://gerrit.cloudera.org:8080/20345 )
Change subject: IMPALA-12361: Implementation of regr_count(), regr_avgx() and regr_avgy() ...................................................................... Patch Set 2: (51 comments) Thanks Pranav! http://gerrit.cloudera.org:8080/#/c/20345/2//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/20345/2//COMMIT_MSG@9 PS2, Line 9: The linear regression functions fit an ordinary-least-squares regression : line to a set of number pairs. These functions (regr_count(), regr_avgx() and regr_avgy()) don't fit a regression line. http://gerrit.cloudera.org:8080/#/c/20345/2//COMMIT_MSG@13 PS2, Line 13: regr_count() regr_count(y, x) would be better. http://gerrit.cloudera.org:8080/#/c/20345/2//COMMIT_MSG@14 PS2, Line 14: used to fit the regression line. It doesn't fit a regression line. We could simply say that it returns the number of non-null pairs. http://gerrit.cloudera.org:8080/#/c/20345/2//COMMIT_MSG@15 PS2, Line 15: e Add space before "(x)". http://gerrit.cloudera.org:8080/#/c/20345/2//COMMIT_MSG@15 PS2, Line 15: regr_avgx() Should be regr_avgx(y, x). http://gerrit.cloudera.org:8080/#/c/20345/2//COMMIT_MSG@16 PS2, Line 16: of the regression line There is not necessarily any regression line fitting involved, I'd leave this out. http://gerrit.cloudera.org:8080/#/c/20345/2//COMMIT_MSG@16 PS2, Line 16: It computes avg(expr2) after the : elimination of null (expr1, expr2). expr1 and expr2 are not defined here, we should not use these. I think it would be better if the first sentence contained the part about eliminating NULLs, something like this: regr_avgx(y, x) computes the average of the independent variable (x) after eliminating pairs containing NULL. http://gerrit.cloudera.org:8080/#/c/20345/2//COMMIT_MSG@18 PS2, Line 18: regr_avgy() evaluates the average of the dependent variable(y) See comments for regr_avgx(). http://gerrit.cloudera.org:8080/#/c/20345/2/be/src/exprs/aggregate-functions-ir.cc File be/src/exprs/aggregate-functions-ir.cc: http://gerrit.cloudera.org:8080/#/c/20345/2/be/src/exprs/aggregate-functions-ir.cc@284 PS2, Line 284: r Add space after //. http://gerrit.cloudera.org:8080/#/c/20345/2/be/src/exprs/aggregate-functions-ir.cc@284 PS2, Line 284: () Should be regr_count(y, x). http://gerrit.cloudera.org:8080/#/c/20345/2/be/src/exprs/aggregate-functions-ir.cc@284 PS2, Line 284: an integer that is I don't think this is useful, just makes the sentence more complicated. http://gerrit.cloudera.org:8080/#/c/20345/2/be/src/exprs/aggregate-functions-ir.cc@284 PS2, Line 284: used to fit : // the regression line. See the comment in the commit message, I don't think it's necessary to refer to a regression line here. http://gerrit.cloudera.org:8080/#/c/20345/2/be/src/exprs/aggregate-functions-ir.cc@289 PS2, Line 289: if (src1.is_null || src2.is_null) return; I think if (!src1.is_null && !src2.is_null) ++dst->val; would be cleaner. http://gerrit.cloudera.org:8080/#/c/20345/2/be/src/exprs/aggregate-functions-ir.cc@296 PS2, Line 296: if (src1.is_null || src2.is_null) return; : --dst->val; I think if (!src1.is_null && !src2.is_null) --dst->val; would be cleaner. http://gerrit.cloudera.org:8080/#/c/20345/2/be/src/exprs/aggregate-functions-ir.cc@304 PS2, Line 304: !(src1.is_null && src2.is_null) This will return true for pairs where only one element is non-null. It should be !src1.is_null && !src2.is_null. Please find a test where it fails and add that test to the others. http://gerrit.cloudera.org:8080/#/c/20345/2/be/src/exprs/aggregate-functions-ir.cc@313 PS2, Line 313: if (!(src1.is_null && src2.is_null)) { See L304. http://gerrit.cloudera.org:8080/#/c/20345/2/be/src/exprs/aggregate-functions-ir.cc@326 PS2, Line 326: //regr_avgx() evaluates the average of the independent variable(x) of the regression See comment about the wording in the commit message. http://gerrit.cloudera.org:8080/#/c/20345/2/be/src/exprs/aggregate-functions-ir.cc@326 PS2, Line 326: /r Add space. http://gerrit.cloudera.org:8080/#/c/20345/2/be/src/exprs/aggregate-functions-ir.cc@328 PS2, Line 328: /r Add space. http://gerrit.cloudera.org:8080/#/c/20345/2/be/src/exprs/aggregate-functions-ir.cc@328 PS2, Line 328: //regr_avgy() evaluates the average of the dependent variable(y) of the regression See comment about the wording in the commit message. http://gerrit.cloudera.org:8080/#/c/20345/2/be/src/exprs/aggregate-functions-ir.cc@330 PS2, Line 330: RegrAvgState Can't we use AvgState instead? That struct is essentially the same, except that it counts the sum instead of a running average. Is there a reason why you count a running average here instead of counting the sum and dividing it in the Finalize*() function like 'avg()'? I think that's more efficient and less prone to rounding errors. http://gerrit.cloudera.org:8080/#/c/20345/2/be/src/exprs/aggregate-functions-ir.cc@330 PS2, Line 330: struct RegrAvgState { We should mention that this struct is used to calculate the above functions, it's not really good if the descriptions just appear there. http://gerrit.cloudera.org:8080/#/c/20345/2/be/src/exprs/aggregate-functions-ir.cc@336 PS2, Line 336: dst->is_null = false; If 'AllocBuffer()' below stays, it will set dst->is_null so we don't need to do it here. http://gerrit.cloudera.org:8080/#/c/20345/2/be/src/exprs/aggregate-functions-ir.cc@338 PS2, Line 338: AllocBuffer(ctx, dst, dst->len); AvgInit() doesn't allocate memory, it uses a fixed length buffer, see AvgInit() and the "// Avg" section in BuiltinsDb.java. We could do the same. http://gerrit.cloudera.org:8080/#/c/20345/2/be/src/exprs/aggregate-functions-ir.cc@346 PS2, Line 346: double y It is enough to take one 'double' parameter, we never use 'y'. http://gerrit.cloudera.org:8080/#/c/20345/2/be/src/exprs/aggregate-functions-ir.cc@354 PS2, Line 354: double y It is enough to take one 'double' parameter, we never use 'y'. http://gerrit.cloudera.org:8080/#/c/20345/2/be/src/exprs/aggregate-functions-ir.cc@355 PS2, Line 355: if (state->count <= 1) { In several places you have the comment that 'Remove... doesn't need to explicitly check the number of calls to Update() or Remove() ..." (for example L376) but you do check it here. A DCHECK that the count is not less than 0 would be enough. http://gerrit.cloudera.org:8080/#/c/20345/2/be/src/exprs/aggregate-functions-ir.cc@365 PS2, Line 365: RegrAvgUpdate If we have 'RegrAvgYUpdate()', this should be 'RegrAvgXUpdate(). http://gerrit.cloudera.org:8080/#/c/20345/2/be/src/exprs/aggregate-functions-ir.cc@374 PS2, Line 374: void AggregateFunctions::RegrAvgRemove(FunctionContext* ctx, If we have 'RegrAvgYRemove()', this should be 'RegrAvgXRemove(). http://gerrit.cloudera.org:8080/#/c/20345/2/be/src/exprs/aggregate-functions-ir.cc@386 PS2, Line 386: TimestampRegrAvgUpdate If we have 'TimestampRegrAvgYUpdate()', this should be 'TimestampRegrAvgXUpdate()'. http://gerrit.cloudera.org:8080/#/c/20345/2/be/src/exprs/aggregate-functions-ir.cc@388 PS2, Line 388: if (src1.is_null || src2.is_null) return; You could have the same DCHECKs here that you have in RegrAvgRemove. http://gerrit.cloudera.org:8080/#/c/20345/2/be/src/exprs/aggregate-functions-ir.cc@399 PS2, Line 399: TimestampRegrAvgRemove If we have 'TimestampRegrAvgYRemove()', this should be 'TimestampRegrAvgXRemove()'. http://gerrit.cloudera.org:8080/#/c/20345/2/be/src/exprs/aggregate-functions-ir.cc@404 PS2, Line 404: if (src1.is_null || src2.is_null) return; You could have the same DCHECKs here that you have in RegrAvgRemove. http://gerrit.cloudera.org:8080/#/c/20345/2/be/src/exprs/aggregate-functions-ir.cc@417 PS2, Line 417: RegrAvgState* src_state = reinterpret_cast<RegrAvgState*>(src.ptr); You could also DCHECK the size of src before casting it. http://gerrit.cloudera.org:8080/#/c/20345/2/be/src/exprs/aggregate-functions-ir.cc@423 PS2, Line 423: if (nA == 0) { If we go over to counting the sum instead of a running avg in the state, this special case will probably not be needed because we won't need to divide by 'dst_state->count'. http://gerrit.cloudera.org:8080/#/c/20345/2/be/src/exprs/aggregate-functions-ir.cc@1006 PS2, Line 1006: Unneeded extra spaces. http://gerrit.cloudera.org:8080/#/c/20345/2/be/src/exprs/aggregate-functions-ir.cc@1013 PS2, Line 1013: Unneeded extra spaces. http://gerrit.cloudera.org:8080/#/c/20345/2/be/src/exprs/aggregate-functions-ir.cc@1019 PS2, Line 1019: Unneeded extra spaces. http://gerrit.cloudera.org:8080/#/c/20345/2/be/src/exprs/aggregate-functions-ir.cc@1047 PS2, Line 1047: Unneeded extra spaces. http://gerrit.cloudera.org:8080/#/c/20345/2/be/src/exprs/aggregate-functions-ir.cc@1143 PS2, Line 1143: Unneeded extra spaces. http://gerrit.cloudera.org:8080/#/c/20345/2/be/src/exprs/aggregate-functions-ir.cc@1149 PS2, Line 1149: Unneeded extra spaces. http://gerrit.cloudera.org:8080/#/c/20345/2/be/src/exprs/aggregate-functions-ir.cc@1154 PS2, Line 1154: Unneeded extra spaces. http://gerrit.cloudera.org:8080/#/c/20345/2/be/src/exprs/aggregate-functions-ir.cc@1554 PS2, Line 1554: Unneeded extra spaces. Usually we don't align operands, we have an indentation of a fixed number of 4 spaces. Here the operands were aligned, which also makes sense, it's just not our convention. But aligning and then adding 4 spaces is not useful. So we should either leave this line as it was before this change OR remove spaces so that the indentation is 4 spaces. http://gerrit.cloudera.org:8080/#/c/20345/2/be/src/exprs/aggregate-functions-ir.cc@2935 PS2, Line 2935: Unneeded extra spaces: we don't usually align operands. http://gerrit.cloudera.org:8080/#/c/20345/2/be/src/exprs/aggregate-functions-ir.cc@2999 PS2, Line 2999: Unneeded extra spaces: we don't usually align operands. http://gerrit.cloudera.org:8080/#/c/20345/2/bin/cmake_aux/create_py3_virtualenv.sh File bin/cmake_aux/create_py3_virtualenv.sh: http://gerrit.cloudera.org:8080/#/c/20345/2/bin/cmake_aux/create_py3_virtualenv.sh@45 PS2, Line 45: # Failure Do these changes come from a rebase? http://gerrit.cloudera.org:8080/#/c/20345/2/testdata/workloads/functional-query/queries/QueryTest/aggregation.test File testdata/workloads/functional-query/queries/QueryTest/aggregation.test: http://gerrit.cloudera.org:8080/#/c/20345/2/testdata/workloads/functional-query/queries/QueryTest/aggregation.test@2084 PS2, Line 2084: null Does it / should it return null? Shouldn't it be 1? http://gerrit.cloudera.org:8080/#/c/20345/2/testdata/workloads/functional-query/queries/QueryTest/aggregation.test@2293 PS2, Line 2293: avgy Should be regr_avgy(). http://gerrit.cloudera.org:8080/#/c/20345/2/testdata/workloads/functional-query/queries/QueryTest/aggregation.test@2317 PS2, Line 2317: null Shouldn't they return the value in the single row? http://gerrit.cloudera.org:8080/#/c/20345/2/testdata/workloads/functional-query/queries/QueryTest/aggregation.test@2460 PS2, Line 2460: regrAvgRemove Shouldn't it be 'regrAvgXRemove()'? http://gerrit.cloudera.org:8080/#/c/20345/2/testdata/workloads/functional-query/queries/QueryTest/aggregation.test@2480 PS2, Line 2480: regrAvgRemove Isn't it 'regrAvgYRemove()'? -- To view, visit http://gerrit.cloudera.org:8080/20345 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ia17c565758565b868d3a54b3ebc12da51c37e143 Gerrit-Change-Number: 20345 Gerrit-PatchSet: 2 Gerrit-Owner: Anonymous Coward <[email protected]> Gerrit-Reviewer: Daniel Becker <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Joe McDonnell <[email protected]> Gerrit-Comment-Date: Tue, 05 Sep 2023 12:42:05 +0000 Gerrit-HasComments: Yes
