jonahgao commented on issue #9394:
URL: 
https://github.com/apache/arrow-datafusion/issues/9394#issuecomment-1970333727

   I think this might be caused by join reorder, execute the following two 
queries in DataFusion
   ```sh
   DataFusion CLI v36.0.0
   ❯ create table a1 as values
     ([1,2,3], 1),
     ([4,5,6], 2);
   
   ❯ create table a2 as values 
     (1),  
     (2);
   
   ❯ create table b as values (10), (20), (30);
   
   ❯ select * from a1,b;
   +-----------+---------+---------+
   | column1   | column2 | column1 |
   +-----------+---------+---------+
   | [1, 2, 3] | 1       | 10      |
   | [4, 5, 6] | 2       | 10      |
   | [1, 2, 3] | 1       | 20      |
   | [4, 5, 6] | 2       | 20      |
   | [1, 2, 3] | 1       | 30      |
   | [4, 5, 6] | 2       | 30      |
   +-----------+---------+---------+
   6 rows in set. Query took 0.006 seconds.
   
   ❯ select * from a2,b;
   +---------+---------+
   | column1 | column1 |
   +---------+---------+
   | 1       | 10      |
   | 1       | 20      |
   | 1       | 30      |
   | 2       | 10      |
   | 2       | 20      |
   | 2       | 30      |
   +---------+---------+
   6 rows in set. Query took 0.006 seconds.
   
   ```
   The result of query `select * from a2,b` is the same to DuckDB, but query 
`select * from a1,b` is different.
   This is because query `select * from a1,b` has been reordered, and we can 
confirm it with `explain verbose`.
   Differentiate the left and right tables by the number of rows in statistics.
   
   This reordering behavior is triggered by the bytes size of tables.
   
https://github.com/apache/arrow-datafusion/blob/ca37ce37933f7874d404364cb8c23438baceb46d/datafusion/core/src/physical_optimizer/join_selection.rs#L73-L77
   
   Because we need to load the left side into memory, this behavior should be 
reasonable.
   
   Additionally, I've also noticed that there is a TODO here, which might be 
helpful.
   > // TODO: In PrestoSQL, the optimizer flips join sides only if one side is 
much smaller than the other by more than SIZE_DIFFERENCE_THRESHOLD times, by 
default is is 8 times.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to