LiaCastaneda opened a new issue, #23237:
URL: https://github.com/apache/datafusion/issues/23237

   ### Describe the bug
   
   While comparing query latencies between DataFusion and Trino on a prod 
workload, I noticed a significant gap for a particular HashJoin pattern. The 
query is a straightforward inner join with:
   
   - A small build side (~32K rows, ~415 distinct string keys)
   - A large probe side (~2.3M rows, heavily skewed — nearly all probe rows 
carry the same key)
   - High fanout: each matched probe row joins to ~78 build rows → ~176M output 
pairs (avg_fanout ~7800%)
   - String join keys (~26 chars)
   - Partitioned mode (if build side has no row statistics, the planner cannot 
prove it is small and repartitions both sides by key)
   
   Observed latency:
   DataFusion: `join_time` ≈ 6s -- query time ≈ 7s
   Trino in the same query conditions (hash partitioned join, same skew): ≈ 
3.4s -- query time ≈ 4.5s
   
   
   
   ### To Reproduce
   
   Left a benchmark repro here https://github.com/apache/datafusion/pull/23209
   
   - [profile of Q23 main](https://share.firefox.dev/4atAYU9)
   - [profile](https://share.firefox.dev/3SzQFTF) on the rough fix attempt 
(avoiding the per-pair key recheck on collision-free build sides 
O(matched_pairs) Arrow allocations on collision-free build sides)
   
   <img width="2804" height="450" alt="Image" 
src="https://github.com/user-attachments/assets/329352e5-4c9e-4ba5-ab37-037a43a72090";
 />
   
   <img width="2908" height="416" alt="Image" 
src="https://github.com/user-attachments/assets/3ababbca-940a-49e3-a7d2-a4889f029589";
 />
   
    The benchmarks results with the fix show Q23 rus x2 faster
   
   ### Expected behavior
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to