Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/18413 )

Change subject: IMPALA-11205: Implement CORR() function
......................................................................


Patch Set 4:

(9 comments)

http://gerrit.cloudera.org:8080/#/c/18413/4//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/18413/4//COMMIT_MSG@10
PS4, Line 10:
nit: remove the leading space


http://gerrit.cloudera.org:8080/#/c/18413/4/be/src/exprs/aggregate-functions-ir.cc
File be/src/exprs/aggregate-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/18413/4/be/src/exprs/aggregate-functions-ir.cc@292
PS4, Line 292: // type and returns the correlation coefficient between them.
Could you comment the formula here?


http://gerrit.cloudera.org:8080/#/c/18413/4/be/src/exprs/aggregate-functions-ir.cc@359
PS4, Line 359:
nit: remove this blank line


http://gerrit.cloudera.org:8080/#/c/18413/4/be/src/exprs/aggregate-functions-ir.cc@366
PS4, Line 366:   ctx->Free(src.ptr);
Could you explain why we need to free it here? It's not allocated by this 
function.


http://gerrit.cloudera.org:8080/#/c/18413/4/be/src/exprs/aggregate-functions-ir.cc@373
PS4, Line 373:   ctx->Free(src.ptr);
Why do we free it here? We still use the CorrState below.


http://gerrit.cloudera.org:8080/#/c/18413/4/be/src/exprs/aggregate-functions-ir.cc@375
PS4, Line 375:   // Calculating correlation coefficient using Pearson Product 
Moment Correlation formula
             :   double corr = ((state->prod*state->count) - 
(state->sumx*state->sumy)) /
             :       (pow((((state->count*state->sum_squaredx) -
             :       (state->sumx*state->sumx)) * 
((state->count*state->sum_squaredy) -
             :                    (state->sumy*state->sumy))), 0.5));
I just realized that we are using a different formula than Hive's:
https://github.com/apache/hive/blob/7b3ecf617a6d46f48a3b6f77e0339fd4ad95a420/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFCorrelation.java#L49-L60

Are they equivalent?


http://gerrit.cloudera.org:8080/#/c/18413/4/testdata/workloads/functional-query/queries/QueryTest/uda.test
File testdata/workloads/functional-query/queries/QueryTest/uda.test:

http://gerrit.cloudera.org:8080/#/c/18413/4/testdata/workloads/functional-query/queries/QueryTest/uda.test@12
PS4, Line 12: '0.00032'
I checked the test codes and found that it already handles rounding inaccurary:
https://github.com/apache/impala/blob/31bd2a47036e7be3a0a32f873b60d2f70dcd8c9f/tests/common/test_result_verifier.py#L176-L185

So we are safe to use double results directly.

BTW, can we add more corr() on other columns?


http://gerrit.cloudera.org:8080/#/c/18413/4/testdata/workloads/functional-query/queries/QueryTest/uda.test@15
PS4, Line 15: ====
Can we add tests on NULL values? Some tpcds tables contain NULLs, e.g. 
catalog_sales has NULLs in cs_sold_date_sk column.


http://gerrit.cloudera.org:8080/#/c/18413/4/testdata/workloads/functional-query/queries/QueryTest/uda.test@32
PS4, Line 32: select corr(double_col,double_col) from functional.alltypes;
Could you add tests for other types as well? E.g.

select
  corr(tinyint_col, tinyint_col),
  corr(smallint_col, smallint_col),
  corr(int_col, int_col),
  corr(bigint_col, bigint_col),
  corr(float_col, float_col),
  corr(double_col, double_col),
  corr(timestamp_col, timestamp_col)
from functional.alltypes;

'alltypes' table doesn't have decimal types. We can use the `decimal_tbl` table.



--
To view, visit http://gerrit.cloudera.org:8080/18413
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I32ad627c953ba24d9cde2d5549bdd0d27a9c0d06
Gerrit-Change-Number: 18413
Gerrit-PatchSet: 4
Gerrit-Owner: Anonymous Coward <[email protected]>
Gerrit-Reviewer: Anonymous Coward <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Quanlong Huang <[email protected]>
Gerrit-Comment-Date: Fri, 15 Apr 2022 01:04:58 +0000
Gerrit-HasComments: Yes

Reply via email to