Re: [I] Improve internal worker parallelism support [datafusion]

via GitHub Thu, 25 Jun 2026 03:58:54 -0700


alamb commented on issue #23174:
URL: https://github.com/apache/datafusion/issues/23174#issuecomment-4798510479


   I think the property that the partitioning / parallelism is explicit in the 
plan is a core one in DataFusion
   
   Another way we could consider modeling the usecases above (e.g. a single 
partition in a WindowFunction) is to keep the multiple partitions, but 
internally use shared state, similar to how `RepartitionExec` or `JoinExec`, 
and `FileInputStream`. That way we keep the parallelism tied to the plan's 
structure (partition_count) but the streams executing each partition can 
dynamically adapt during plan time to better use resources
   
   Concretely, maybe instead of
   
   ```
   (any downstream exec)
   -- RepartitionExec(round-robin on batch, input_partitions=1, 
otuput_partitions=32)
   ---- WindowExec(partition=1, internal_parallelism=32)
   ------ CoalescePartitionExec(input_partitions=32, output_partitions=1)
   -------- CsvExec(partition=32, internal_parallelism=1)
   ```
   
   We had something like
   
   ```
   (any downstream exec)
   ---- WindowExec(partition=32)
   -------- CsvExec(partition=32)
   ```
   
   And then **internally** within `WindowExec(partition=32, 
internal_parallelism=32)` it knew enough to coalesce the data into a single 
partition, and then split the work across the multiple partition streams (with 
a segment tree or whatever)
   
   I don't think this is fundamentally different than a single exec with 
internal paralleism, but I think it keeps task/cpi model the same


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Improve internal worker parallelism support [datafusion]

Reply via email to