Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/18413 )
Change subject: IMPALA-11205: Implement CORR() function ...................................................................... Patch Set 4: (9 comments) http://gerrit.cloudera.org:8080/#/c/18413/4//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/18413/4//COMMIT_MSG@10 PS4, Line 10: nit: remove the leading space http://gerrit.cloudera.org:8080/#/c/18413/4/be/src/exprs/aggregate-functions-ir.cc File be/src/exprs/aggregate-functions-ir.cc: http://gerrit.cloudera.org:8080/#/c/18413/4/be/src/exprs/aggregate-functions-ir.cc@292 PS4, Line 292: // type and returns the correlation coefficient between them. Could you comment the formula here? http://gerrit.cloudera.org:8080/#/c/18413/4/be/src/exprs/aggregate-functions-ir.cc@359 PS4, Line 359: nit: remove this blank line http://gerrit.cloudera.org:8080/#/c/18413/4/be/src/exprs/aggregate-functions-ir.cc@366 PS4, Line 366: ctx->Free(src.ptr); Could you explain why we need to free it here? It's not allocated by this function. http://gerrit.cloudera.org:8080/#/c/18413/4/be/src/exprs/aggregate-functions-ir.cc@373 PS4, Line 373: ctx->Free(src.ptr); Why do we free it here? We still use the CorrState below. http://gerrit.cloudera.org:8080/#/c/18413/4/be/src/exprs/aggregate-functions-ir.cc@375 PS4, Line 375: // Calculating correlation coefficient using Pearson Product Moment Correlation formula : double corr = ((state->prod*state->count) - (state->sumx*state->sumy)) / : (pow((((state->count*state->sum_squaredx) - : (state->sumx*state->sumx)) * ((state->count*state->sum_squaredy) - : (state->sumy*state->sumy))), 0.5)); I just realized that we are using a different formula than Hive's: https://github.com/apache/hive/blob/7b3ecf617a6d46f48a3b6f77e0339fd4ad95a420/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFCorrelation.java#L49-L60 Are they equivalent? http://gerrit.cloudera.org:8080/#/c/18413/4/testdata/workloads/functional-query/queries/QueryTest/uda.test File testdata/workloads/functional-query/queries/QueryTest/uda.test: http://gerrit.cloudera.org:8080/#/c/18413/4/testdata/workloads/functional-query/queries/QueryTest/uda.test@12 PS4, Line 12: '0.00032' I checked the test codes and found that it already handles rounding inaccurary: https://github.com/apache/impala/blob/31bd2a47036e7be3a0a32f873b60d2f70dcd8c9f/tests/common/test_result_verifier.py#L176-L185 So we are safe to use double results directly. BTW, can we add more corr() on other columns? http://gerrit.cloudera.org:8080/#/c/18413/4/testdata/workloads/functional-query/queries/QueryTest/uda.test@15 PS4, Line 15: ==== Can we add tests on NULL values? Some tpcds tables contain NULLs, e.g. catalog_sales has NULLs in cs_sold_date_sk column. http://gerrit.cloudera.org:8080/#/c/18413/4/testdata/workloads/functional-query/queries/QueryTest/uda.test@32 PS4, Line 32: select corr(double_col,double_col) from functional.alltypes; Could you add tests for other types as well? E.g. select corr(tinyint_col, tinyint_col), corr(smallint_col, smallint_col), corr(int_col, int_col), corr(bigint_col, bigint_col), corr(float_col, float_col), corr(double_col, double_col), corr(timestamp_col, timestamp_col) from functional.alltypes; 'alltypes' table doesn't have decimal types. We can use the `decimal_tbl` table. -- To view, visit http://gerrit.cloudera.org:8080/18413 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I32ad627c953ba24d9cde2d5549bdd0d27a9c0d06 Gerrit-Change-Number: 18413 Gerrit-PatchSet: 4 Gerrit-Owner: Anonymous Coward <[email protected]> Gerrit-Reviewer: Anonymous Coward <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Quanlong Huang <[email protected]> Gerrit-Comment-Date: Fri, 15 Apr 2022 01:04:58 +0000 Gerrit-HasComments: Yes
