tustvold edited a comment on issue #2079: URL: https://github.com/apache/arrow-datafusion/issues/2079#issuecomment-1083214449
> How would we parallelize the physical operators after this TableScanOp? You would have multiple single-partition `ParquetExec` being fed into a single `UnionExec`, potentially with some `ProjectionExec`, etc... in between. As each Datafusion partition gets its own tokio task => parallelism. If you wanted parallelism within a single file, you would have an optimizer pass that would replace the single `ParquetExec` with multiple with disjoint row groups, again this would be fed into a `UnionExec`. > What do you think if we avoid async and stream from normal operators' execute()? Let me get back to you on this, it is something I am currently mulling about and experimenting with. I agree that using async for CPU-bound work seems a little wonky, but as @alamb articulated [here](https://thenewstack.io/using-rustlangs-async-tokio-runtime-for-cpu-bound-tasks/) there are reasons that it may be the pragmatic choice. I'm trying to collect some data so we can make an informed decision :sweat_smile: _FWIW as you link to the morsel driven paper - what you describe I think is closer to the more traditional plan-driven parallelism than morsel-driven parallelism. Tokio is much closer to that paper than what you describe as it incorporates notions of dynamic scheduling and work-stealing, rayon may be even closer_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
