LiaCastaneda opened a new issue, #23237: URL: https://github.com/apache/datafusion/issues/23237
### Describe the bug While comparing query latencies between DataFusion and Trino on a prod workload, I noticed a significant gap for a particular HashJoin pattern. The query is a straightforward inner join with: - A small build side (~32K rows, ~415 distinct string keys) - A large probe side (~2.3M rows, heavily skewed — nearly all probe rows carry the same key) - High fanout: each matched probe row joins to ~78 build rows → ~176M output pairs (avg_fanout ~7800%) - String join keys (~26 chars) - Partitioned mode (if build side has no row statistics, the planner cannot prove it is small and repartitions both sides by key) Observed latency: DataFusion: `join_time` ≈ 6s -- query time ≈ 7s Trino in the same query conditions (hash partitioned join, same skew): ≈ 3.4s -- query time ≈ 4.5s ### To Reproduce Left a benchmark repro here https://github.com/apache/datafusion/pull/23209 - [profile of Q23 main](https://share.firefox.dev/4atAYU9) - [profile](https://share.firefox.dev/3SzQFTF) on the rough fix attempt (avoiding the per-pair key recheck on collision-free build sides O(matched_pairs) Arrow allocations on collision-free build sides) <img width="2804" height="450" alt="Image" src="https://github.com/user-attachments/assets/329352e5-4c9e-4ba5-ab37-037a43a72090" /> <img width="2908" height="416" alt="Image" src="https://github.com/user-attachments/assets/3ababbca-940a-49e3-a7d2-a4889f029589" /> The benchmarks results with the fix show Q23 rus x2 faster ### Expected behavior _No response_ ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
