neilconway commented on PR #22652:
URL: https://github.com/apache/datafusion/pull/22652#issuecomment-4634578768

   @alamb There are two scenarios where this should be a win:
   
   1. Replacing an inner join with a left semi-join, where the inner join would 
only produce a single matching row for each left tuple. Exactly the same 
intermediate result sets, but should be slightly faster due to less join 
overhead etc. I haven't done detailed microbenchmarks yet, but you're right 
that it seems we aren't seeing major wins from this and it might merit further 
investigation.
   2. Replacing an inner join with a left semi-join, where each left tuple can 
match multiple right tuples, but then those duplicates are ignored/filtered out 
by an upstream operator (e.g., `DISTINCT`). In this scenario, this optimization 
could be a major performance win, depending on the number of duplicates 
produced of course.
   
   (Symmetric for right semi joins as well, of course.)
   
   I can't see a scenario where it would be a perf regression.
   
   There were some semi-join planner/stats bugs that resulted in picking worse 
plans; https://github.com/apache/datafusion/pull/22762 fixes the last one that 
I know about. Once that lands I'll rerun the benchmarks for this PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to