tustvold edited a comment on issue #2079:
URL: 
https://github.com/apache/arrow-datafusion/issues/2079#issuecomment-1079001425


   I agree entirely that the current scheduling within DataFusion, which 
effectively punts onto tokio and hopes it does something vaguely appropriate, 
is likely sub-optimal. In fact one of the issues I'm having with #1617 appears 
to be that sub-optimal task scheduling is causing a 2x performance regression.
   
   That being said, I think there are three problems here and it would be 
advantageous in my mind to keep them separate:
   
   * Translation of a `LogicalPlan::TableScan` into a corresponding 
`ExecutionPlan`
   * Optimisation of this `ExecutionPlan` to potentially introduce additional 
parallelism
   * Scheduling of this `ExecutionPlan` to actually perform its computation
   
   This ticket is then concerned solely with the first of these, and reducing 
the complexity of the file-format specific operators necessary to achieve this. 
I think the remaining two problems can and should be kept separate, in part 
because they have broader scope than just scanning files.
   
   _FWIW we sort of already do the stage-based execution you describe, but it 
is implicit based on whether operators call `tokio::spawn` or just compose the 
`SendableRecordBatchStream` together. I think making this more explicit sounds 
like a good idea, but I'm not sure it is a simple case of grouping operators 
together, and would likely involve altering the contract of `ExecutionPlan` to 
be more explicit about what operators spawn additional parallel tasks. Perhaps 
this is what you're proposing?_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to