neilconway opened a new pull request, #22674: URL: https://github.com/apache/datafusion/pull/22674
## Which issue does this PR close? - Closes #22673 ## Rationale for this change `estimate_join_cardinality` for semi-joins checks if ANY of the columns in the two join inputs are disjoint (comparing columns positionally); if so, it claims the join will not return any rows. This is wrong, for two reasons: 1. If two columns don't participate in the join key, they have no impact on the cardinality of the join result 2. Comparing arbitrary columns positionally is not a sensible thing to do in the first place A similar issue exists for anti-joins, except we assume the anti-join will return the entire join input in this case. We should instead just check for disjoint ranges over the pairs of columns that make up the join key. ## What changes are included in this PR? * Fix `estimate_join_cardinality` behavior in the face of disjoint column ranges that aren't join key columns * Refactor `estimate_join_cardinality`, rename a variable for clarity * Add unit test ## Are these changes tested? Yes, new test added. ## Are there any user-facing changes? Better plans / avoid buggy cardinality estimate. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
