yjshen edited a comment on issue #2079: URL: https://github.com/apache/arrow-datafusion/issues/2079#issuecomment-1083205518
> Translation of a LogicalPlan::TableScan into a corresponding ExecutionPlan Sounds great. We could eliminate much of the common logic scattered in different file formats, as you mentioned. And yes, this makes sense! > removing the ability to scan multiple files within a single operator, and instead composing multiple per-file operators in the generated ExecutionPlan. I'm confused here: How would we parallelize the physical operators after this `TableScanOp`? And how do we control this `single` operators parallelism? Will the data batch be sent through a multi-producer(the single ops)-multi-consumer(like filter/project/sort) queue? > compose the SendableRecordBatchStream together. I think making this more explicit sounds like a good idea, but I'm not sure it is a simple case of grouping operators together, and would likely involve altering the contract of ExecutionPlan to be more explicit about what operators spawn additional parallel tasks. Perhaps this is what you're proposing? What do you think if we avoid `async` and `stream` from normal operators' `execute()`? just like you and @alamb mentioned merging project/filter logic into scan operator, these pure computations are async free in my opinion, they are data or computation-intensive and do not talk with IO systems. What do you think if we have (similar to that of [Morsel-driven](https://15721.courses.cs.cmu.edu/spring2016/papers/p743-leis.pdf)): <img width="847" alt="image" src="https://user-images.githubusercontent.com/1387718/160857783-cd666871-8063-433a-92b3-98cc097ca117.png"> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
