neilconway commented on PR #22652: URL: https://github.com/apache/datafusion/pull/22652#issuecomment-4650037027
The latest benchmark numbers look better overall. Looking at the TPC-H numbers, at least locally w/ SF10, it seems that Q22 improves because switching to a semi-join results in emitting different stats for the join output, and those stats result in avoiding some poor planner choices we made for operators above the semi-join. So while semi-joins are a bit easier to compute stats for than inner joins, the win there is a bit coincidental. Q13 and Q22 are noise (no plan change from this rewrite). I looked at TPC-DS Q14, it's interesting: locally, the rewrite regresses performance by ~15% with >= 2 partitions but improves performance by ~10% with a single partition. The reason seems to be that using a semi-join results in less flexibility for subsequent partition decision making, so the overall plan performs worse because it needs a repartition that the inner join plan didn't. I think we're okay to merge this as-is, but there's a bunch of room for further improvements: 1. #22794 reduced semi-join overhead compared to inner joins, but semi-joins are still slower than inner joins in DF. We might still see overall performance improvements from this rewrite -- two causes might be plan changes/improvements, or perhaps because semi-joins effectively do duplicate elimination sooner, which reduces the # of intermediate values passed between operators. But if we had a specialized join kernel for semi-joins I think we could get significantly better performance on the join itself. 4. As mentioned for Q14 above, semi-joins actually have one negative property compared to inner joins: their partitioning behavior is less flexible. Once we commit to LeftSemi vs. RightSemi (which we do on the basis of which join operand is smaller), we never consider swapping back -- even if changing the join orientation would result in better partitioning behavior for downstream operators. That seems like something we could improve, e.g., in `EnsureRequirements. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
