korowa commented on issue #9928:
URL: 
https://github.com/apache/arrow-datafusion/issues/9928#issuecomment-2054147818

   According to input plan, and how the join is planned
   ```
   HashJoinExec: mode=>>>CollectLeft<<<, join_type=Full
   ```
   target partitions value is 1 (single core, works as expected, ok), since 
CollectLeft planned by default for single partition, otherwise it would be 
Auto/Partitioned.
   
   In the same time, ParquetExec is planned as if there are 146 partitions
   ```
   ParquetExec: file_groups={146 groups: [[...]]}
   ```
   normally datafusion plans 1 file group per target partition -- so this plan 
is already a bit inconsistent, as ParquetExec planned through delta-rs doesn't 
seem to respect partition limitations (if I'm not mistaken).
   
   After initial planning there is optimization phase -- currently DF has some 
issues with FULL (and some other) joins in CollectLeft mode, so 
`join_selection` optimizer rule converts CollectLeft to Partitioned, and after 
that `enforce_distribution` normally adds RepartitionExec operators, but it 
doesn't happen for single partition (since it's already satisfied, and 
repartitioning N -> 1 won't increase execution parallelism) --  there are no 
repartitions in result plan, and a a result we have Partitioned HashJoinExec 
with 1 and 146 partitions input, which is an incorrect operator configuration.
   
   In the same time increasing number of target partitions allows 
`enforce_distribution` to add RepartitionExec-s, (since it's not satisfied) and 
result plan is fine.
   
   To sum up:
   1) Looks like ParquetExec planning is not working as expected in delta-rs -- 
according to [this 
function](https://github.com/delta-io/delta-rs/blob/d49d95ba4bf576254122b8491c4d73c5f65b16b0/crates/core/src/delta_datafusion/mod.rs#L506)
 DeltaScan builder creates ParquetExec with file_groups, grouped by 
partition_column values, and doesn't take into account `target_partitions` 
config.
   2) #9757 should help with CollectLeft FULL / LEFT OUTER join execution (not 
the root cause of the issue though)
   3) Maybe needs some plan validation to ensure that no operators violate 
restrictions originated from configuration :thinking: ? 
   
   Any other ideas and comment are welcome.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to