[GitHub] [arrow-datafusion] yjshen edited a comment on issue #2079: RFC: More Granular File Operators

GitBox Wed, 30 Mar 2022 07:27:00 -0700


yjshen edited a comment on issue #2079:
URL: 
https://github.com/apache/arrow-datafusion/issues/2079#issuecomment-1083205518

> Translation of a LogicalPlan::TableScan into a corresponding ExecutionPlan

Sounds great. We could eliminate much of the common logic scattered in
different file formats, as you mentioned. And yes, this makes sense!

> removing the ability to scan multiple files within a single operator, and
instead composing multiple per-file operators in the generated ExecutionPlan.

I'm confused here: How would we parallelize the physical operators after
this `TableScanOp`? And how do we control this `single` operators parallelism?
Will the data batch be sent through a multi-producer(the single
ops)-multi-consumer(like filter/project/sort) queue?

> compose the SendableRecordBatchStream together. I think making this more
explicit sounds like a good idea, but I'm not sure it is a simple case of
grouping operators together, and would likely involve altering the contract of
ExecutionPlan to be more explicit about what operators spawn additional
parallel tasks. Perhaps this is what you're proposing?

What do you think if we avoid `async` and `stream` from normal operators'
`execute()`? just like you and @alamb mentioned merging project/filter logic
into scan operator, these pure computations are async free in my opinion, they
are data or computation-intensive and do not talk with IO systems. What do you
think if we have (similar to that of
[Morsel-driven](https://15721.courses.cs.cmu.edu/spring2016/papers/p743-leis.pdf)):

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] yjshen edited a comment on issue #2079: RFC: More Granular File Operators

Reply via email to