nuno-faria opened a new issue, #22098:
URL: https://github.com/apache/datafusion/issues/22098

   Currently, the order of a hash join (i.e., which relation is hashed and 
which is probed) is dictated by the size of the relations in bytes. This is 
controlled by the `should_swap_join_order` function:
   
https://github.com/apache/datafusion/blob/e62f06c914cfdfb388abf0b49a45afe015d76fcb/datafusion/physical-optimizer/src/join_selection.rs#L79-L87
   
   This is a safe heuristic, since we want to increase the probability of the 
hash table fitting in memory. However, if one of the tables is considerably 
wider than the other, we can end up choosing a narrower table but with many 
more rows as the build side, which can be slower. I did some tests which joined 
a narrow table (fixed to 100M rows) with a wide one (from 10M to 100M rows), 
disabling the automatic join order with `datafusion.optimizer.join_reordering` 
to test both narrow+wide and wide+narrow:
   
   ```sql
   -- narrow (fixed to 100M rows)
   create table t1 (k int, v int)
   insert into t1 select i as k, i as v from generate_series(1, 100000000) t(i)
   
   -- wide (variable num of rows)
   create table t2 (k int, v varchar)
   insert into t2 select i as k, i || repeat('0', 54) as v from 
generate_series(1, ...) t(i)
   ```
   
   Here are the results (8 vCPUs, 32GB Mem, avg of 10 runs):
   
   <img width="400" alt="Image" 
src="https://github.com/user-attachments/assets/9d57880f-7c99-477d-8ff8-1aeb2acf9597";
 />
   
   The wide table with 10M rows is slightly larger in bytes than the narrow one 
with 100M, so the current DataFusion implementation always uses the 
`narrow_join_wide` version. We see that this version is only faster when the 
wide table has 90M rows, i.e. only 10% fewer rows that the narrow one.
   
   So what I'd like to discuss is if the hash join order should be decided 
based on a more complex heuristic. For example, "if the difference in size 
between the tables is less than X, go by row count, otherwise go by byte size". 
It appears that Postgres also does something like this:
   
   ```sql
   -- t1 (narrow): 10M rows
   -- t2 (wide): 8M rows <-- hashed
   Hash Cond: (t1.k = t2.k)
   ->  Seq Scan on t1  (... width=8) (actual time=0.016..452.699 
rows=10000000.00 loops=1)
   ->  Hash  (...) (actual time=3315.965..3315.966 rows=8000000.00 loops=1)
         Buckets: 8388608  Batches: 1  Memory Usage: 830076kB
         ->  Seq Scan on t2  (... width=65) (actual time=0.027..462.818 
rows=8000000.00 loops=1)
   
   -- t1 (narrow): 10M rows <-- hashed
   -- t2 (wide): 9.5M rows
   Hash Cond: (t2.k = t1.k)
   ->  Seq Scan on t2  (... width=65) (actual time=0.010..437.648 
rows=9500000.00 loops=1)
   ->  Hash  (...) (actual time=3013.599..3013.600 rows=10000000.00 loops=1)
         Buckets: 16777216  Batches: 1  Memory Usage: 521697kB
         ->  Seq Scan on t1  (... width=8) (actual time=0.028..472.798 
rows=10000000.00 loops=1)
   ```
   
   For reference, here are the sizes of the tables. It was only faster to use 
the narrow table once the wider one was 9x larger.
   <img width="400" alt="Image" 
src="https://github.com/user-attachments/assets/bd136af9-acec-44be-aecd-3dfb6b753e66";
 />
   
   cc: @alamb @Dandandan @adriangb since I think you are also interested in 
hash join performance.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to