2010YOUY01 commented on issue #23174: URL: https://github.com/apache/datafusion/issues/23174#issuecomment-4806026533
> I think the property that the partitioning / parallelism is explicit in the plan is a core one in DataFusion > > Another way we could consider modeling the usecases above (e.g. a single partition in a WindowFunction) is to keep the multiple partitions, but internally use shared state, similar to how `RepartitionExec` or `JoinExec`, and `FileInputStream`. That way we keep the parallelism tied to the plan's structure (partition_count) but the streams executing each partition can dynamically adapt during plan time to better use resources > > Concretely, maybe instead of > > ``` > (any downstream exec) > -- RepartitionExec(round-robin on batch, input_partitions=1, otuput_partitions=32) > ---- WindowExec(partition=1, internal_parallelism=32) > ------ CoalescePartitionExec(input_partitions=32, output_partitions=1) > -------- CsvExec(partition=32, internal_parallelism=1) > ``` > > We had something like > > ``` > (any downstream exec) > ---- WindowExec(partition=32) > -------- CsvExec(partition=32) > ``` > > And then **internally** within `WindowExec(partition=32, internal_parallelism=32)` it knew enough to coalesce the data into a single partition, and then split the work across the multiple partition streams (with a segment tree or whatever) > > I don't think this is fundamentally different than a single exec with internal paralleism, but I think it keeps task/cpi model the same I think this simplified model is a good idea in single-node cases. My concern is that, for distributed use cases, shared-state parallelism usually isn't easily applicable, so we may want to explicitly separate partition parallelism from internal parallelism to make the adaptation easier. cc @gabotechs who has been working on `datafusion-distributed` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
