neilconway commented on PR #22652: URL: https://github.com/apache/datafusion/pull/22652#issuecomment-4634578768
@alamb There are two scenarios where this should be a win: 1. Replacing an inner join with a left semi-join, where the inner join would only produce a single matching row for each left tuple. Exactly the same intermediate result sets, but should be slightly faster due to less join overhead etc. I haven't done detailed microbenchmarks yet, but you're right that it seems we aren't seeing major wins from this and it might merit further investigation. 2. Replacing an inner join with a left semi-join, where each left tuple can match multiple right tuples, but then those duplicates are ignored/filtered out by an upstream operator (e.g., `DISTINCT`). In this scenario, this optimization could be a major performance win, depending on the number of duplicates produced of course. (Symmetric for right semi joins as well, of course.) I can't see a scenario where it would be a perf regression. There were some semi-join planner/stats bugs that resulted in picking worse plans; https://github.com/apache/datafusion/pull/22762 fixes the last one that I know about. Once that lands I'll rerun the benchmarks for this PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
