neilconway commented on PR #22652:
URL: https://github.com/apache/datafusion/pull/22652#issuecomment-4650037027

   The latest benchmark numbers look better overall.
   
   Looking at the TPC-H numbers, at least locally w/ SF10, it seems that Q22 
improves because switching to a semi-join results in emitting different stats 
for the join output, and those stats result in avoiding some poor planner 
choices we made for operators above the semi-join. So while semi-joins are a 
bit easier to compute stats for than inner joins, the win there is a bit 
coincidental. Q13 and Q22 are noise (no plan change from this rewrite).
   
   I looked at TPC-DS Q14, it's interesting: locally, the rewrite regresses 
performance by ~15% with >= 2 partitions but improves performance by ~10% with 
a single partition. The reason seems to be that using a semi-join results in 
less flexibility for subsequent partition decision making, so the overall plan 
performs worse because it needs a repartition that the inner join plan didn't.
   
   I think we're okay to merge this as-is, but there's a bunch of room for 
further improvements:
   
   1. #22794 reduced semi-join overhead compared to inner joins, but semi-joins 
are still slower than inner joins in DF. We might still see overall performance 
improvements from this rewrite -- two causes might be plan 
changes/improvements, or perhaps because semi-joins effectively do duplicate 
elimination sooner, which reduces the # of intermediate values passed between 
operators. But if we had a specialized join kernel for semi-joins I think we 
could get significantly better performance on the join itself.
   4. As mentioned for Q14 above, semi-joins actually have one negative 
property compared to inner joins: their partitioning behavior is less flexible. 
Once we commit to LeftSemi vs. RightSemi (which we do on the basis of which 
join operand is smaller), we never consider swapping back -- even if changing 
the join orientation would result in better partitioning behavior for 
downstream operators. That seems like something we could improve, e.g., in 
`EnsureRequirements.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to