Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/17344 )
Change subject: IMPALA-10677: Set selectivity of Not-equal ...................................................................... Patch Set 4: (7 comments) Thanks you for working on this. Seems like NULL values are not handled correctly right now. I think we should either use the number of NULLs from column stats and return correct results, or we should not set selectivity when the parameter is NULL. Please note that the NOT DISTINCT FROM test also produces wrong values in case of NULLs. http://gerrit.cloudera.org:8080/#/c/17344/4//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/17344/4//COMMIT_MSG@7 PS4, Line 7: IMPALA-10677 You might want to use IMPALA-7560 as this was the original Jira and it contains more information. http://gerrit.cloudera.org:8080/#/c/17344/4//COMMIT_MSG@11 PS4, Line 11: col = 5", but not "2 * col = 10 nit: wouldn't be better to use != in the example? "col != 5", but not "2 * col != 10" http://gerrit.cloudera.org:8080/#/c/17344/3/fe/src/main/java/org/apache/impala/analysis/BinaryPredicate.java File fe/src/main/java/org/apache/impala/analysis/BinaryPredicate.java: http://gerrit.cloudera.org:8080/#/c/17344/3/fe/src/main/java/org/apache/impala/analysis/BinaryPredicate.java@257 PS3, Line 257: distinctValues > Okay. Thanks for the verification. We could make an adjustment here if ther SlotRef retrieves the NDV from the SlotDescriptor's ColumnStats in SlotRef.adjustNumDistinctValues() . ColumnStats also has NumNulls which could be used here: https://github.com/apache/impala/blob/954eb5c85d329af7690698cdc4d0f409260e6d18/fe/src/main/java/org/apache/impala/catalog/ColumnStats.java#L221 http://gerrit.cloudera.org:8080/#/c/17344/4/fe/src/test/java/org/apache/impala/analysis/ExprCardinalityTest.java File fe/src/test/java/org/apache/impala/analysis/ExprCardinalityTest.java: http://gerrit.cloudera.org:8080/#/c/17344/4/fe/src/test/java/org/apache/impala/analysis/ExprCardinalityTest.java@a372 PS4, Line 372: That seems wrong looking at the data. I think it should be 1. http://gerrit.cloudera.org:8080/#/c/17344/4/fe/src/test/java/org/apache/impala/analysis/ExprCardinalityTest.java@344 PS4, Line 344: 7300 NULL values are not included in the NDV, also null count is stored separately in the column stats. I think we should handle NULLs specially, as mentioned in the above comment. I.e. we should have the same values here that are commented out in the original test case.In this case 1.0, as there are no NULL values for 'id' in the alltypes table. http://gerrit.cloudera.org:8080/#/c/17344/4/fe/src/test/java/org/apache/impala/analysis/ExprCardinalityTest.java@355 PS4, Line 355: 1 - 1.0/1 I think the expected value is 0 here, as null_str is all nulls. http://gerrit.cloudera.org:8080/#/c/17344/4/fe/src/test/java/org/apache/impala/analysis/ExprCardinalityTest.java@359 PS4, Line 359: some_nulls is not distinct from 'foo'", I think it'd be worth to add a test case with some_nulls is distinct from null I think the expected value would be "6.0 / 26.0", i.e. (NumRows - NumNulls) / (NumRows) -- To view, visit http://gerrit.cloudera.org:8080/17344 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Icd6f5945840ea2a8194d72aa440ddfa6915cbb3a Gerrit-Change-Number: 17344 Gerrit-PatchSet: 4 Gerrit-Owner: liuyao <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Qifan Chen <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]> Gerrit-Reviewer: liuyao <[email protected]> Gerrit-Comment-Date: Fri, 11 Jun 2021 15:36:03 +0000 Gerrit-HasComments: Yes
