Qifan Chen has posted comments on this change. ( http://gerrit.cloudera.org:8080/17387 )
Change subject: IMPALA-10681: Improve join cardinality estimates ...................................................................... Patch Set 3: (9 comments) Hi Aman, Thanks a lot for the work. I think it should help with the cases nicely. http://gerrit.cloudera.org:8080/#/c/17387/3//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/17387/3//COMMIT_MSG@20 PS3, Line 20: and nit. extra http://gerrit.cloudera.org:8080/#/c/17387/3//COMMIT_MSG@21 PS3, Line 21: If it has a group-by and the group-by : columns alread have associated NDV, we can can still know the : combined NDV. Just wonder if this case is handled in the code. http://gerrit.cloudera.org:8080/#/c/17387/3/fe/src/main/java/org/apache/impala/planner/JoinNode.java File fe/src/main/java/org/apache/impala/planner/JoinNode.java: http://gerrit.cloudera.org:8080/#/c/17387/3/fe/src/main/java/org/apache/impala/planner/JoinNode.java@430 PS3, Line 430: getOtherJoinCardinality getNdvAdjustedJoinCardinality() may sound more close in semanticds of this function. http://gerrit.cloudera.org:8080/#/c/17387/3/fe/src/main/java/org/apache/impala/planner/JoinNode.java@431 PS3, Line 431: long lhsCard, long rhsCard May add a comment for these two input variables. http://gerrit.cloudera.org:8080/#/c/17387/3/fe/src/main/java/org/apache/impala/planner/JoinNode.java@441 PS3, Line 441: if (stats.lhsNumRows() > lhsCard) lhsAdjNdv *= lhsCard / stats.lhsNumRows(); I wonder if we should trust lhsCard more here. On paper, NDV of an aggregate function is always 1, while the row count at the table level is somewhat accurate, and the row count above the scan is much less. http://gerrit.cloudera.org:8080/#/c/17387/3/fe/src/main/java/org/apache/impala/planner/JoinNode.java@448 PS3, Line 448: long joinCard = Math.round((lhsCard / Math.max(1, Math.max(lhsAdjNdv, rhsAdjNdv))) * : rhsCard); For equi- inner joins and the max. cardinality is to be returned, it seems an alternative formula could be (CardL / NdvL) * (CardR / NdvR) which is more intuitive to read and takes care of the occurrence frequency for both sides. http://gerrit.cloudera.org:8080/#/c/17387/3/fe/src/main/java/org/apache/impala/planner/JoinNode.java@550 PS3, Line 550: if (!hasNdvStats(slotDesc)) return false; : if (!hasNumRowsStats(slotDesc)) return false; : return true; nit. May shorten to return (hasNdvStats() && hasNumRowsStats()). http://gerrit.cloudera.org:8080/#/c/17387/3/fe/src/main/java/org/apache/impala/planner/JoinNode.java@606 PS3, Line 606: return Math.min(lhsNdv_, lhsNumRows_); nit. May adjust the lhsNdv_ in the cstr to avoid repeated computation. http://gerrit.cloudera.org:8080/#/c/17387/3/fe/src/main/java/org/apache/impala/planner/JoinNode.java@607 PS3, Line 607: return Math.min(rhsNdv_, rhsNumRows_); Same as above. -- To view, visit http://gerrit.cloudera.org:8080/17387 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I8aa9d3b8f3c4848b3e9414fe19ad7ad348d12ecc Gerrit-Change-Number: 17387 Gerrit-PatchSet: 3 Gerrit-Owner: Aman Sinha <[email protected]> Gerrit-Reviewer: Aman Sinha <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Qifan Chen <[email protected]> Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]> Gerrit-Comment-Date: Wed, 19 May 2021 15:09:17 +0000 Gerrit-HasComments: Yes
