tustvold edited a comment on issue #2079: URL: https://github.com/apache/arrow-datafusion/issues/2079#issuecomment-1079001425
I agree entirely that the current scheduling within DataFusion, which effectively punts onto tokio and hopes it does something vaguely appropriate, is likely sub-optimal. In fact one of the issues I'm having with #1617 appears to be that sub-optimal task scheduling is causing a 2x performance regression. That being said, I think there are three problems here and it would be advantageous in my mind to keep them separate: * Translation of a `LogicalPlan::TableScan` into a corresponding `ExecutionPlan` * Optimisation of this `ExecutionPlan` to potentially introduce additional parallelism * Scheduling of this `ExecutionPlan` to actually perform its computation This ticket is then concerned solely with the first of these, and reducing the complexity of the file-format specific operators necessary to achieve this. I think the remaining two problems can and should be kept separate, in part because they have broader scope than just scanning files. _FWIW we sort of already do the stage-based execution you describe, but it is implicit based on whether operators call `tokio::spawn` or just compose the `SendableRecordBatchStream` together. I think making this more explicit sounds like a good idea, but I'm not sure it is a simple case of grouping operators together, and would likely involve altering the contract of `ExecutionPlan` to be more explicit about what operators spawn additional parallel tasks. Perhaps this is what you're proposing?_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
