alamb commented on issue #23174: URL: https://github.com/apache/datafusion/issues/23174#issuecomment-4798510479
I think the property that the partitioning / parallelism is explicit in the plan is a core one in DataFusion Another way we could consider modeling the usecases above (e.g. a single partition in a WindowFunction) is to keep the multiple partitions, but internally use shared state, similar to how `RepartitionExec` or `JoinExec`, and `FileInputStream`. That way we keep the parallelism tied to the plan's structure (partition_count) but the streams executing each partition can dynamically adapt during plan time to better use resources Concretely, maybe instead of ``` (any downstream exec) -- RepartitionExec(round-robin on batch, input_partitions=1, otuput_partitions=32) ---- WindowExec(partition=1, internal_parallelism=32) ------ CoalescePartitionExec(input_partitions=32, output_partitions=1) -------- CsvExec(partition=32, internal_parallelism=1) ``` We had something like ``` (any downstream exec) ---- WindowExec(partition=32) -------- CsvExec(partition=32) ``` And then **internally** within `WindowExec(partition=32, internal_parallelism=32)` it knew enough to coalesce the data into a single partition, and then split the work across the multiple partition streams (with a segment tree or whatever) I don't think this is fundamentally different than a single exec with internal paralleism, but I think it keeps task/cpi model the same -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
