thinkharderdev opened a new issue, #11451:
URL: https://github.com/apache/datafusion/issues/11451

   ### Is your feature request related to a problem or challenge?
   
   Certain operators (eg `AggregateExec`, `SortExec`) are "greedy" in that they 
will continue processing batches as long as they are produced from their input 
without yielding. This can can effectively create a hot loop which can 
monopolize worker threads and starve other tasks in the runtime. 
   
   ### Describe the solution you'd like
   
   Add an optional `coop_budget` to the `ExecutionOptions`. When set, greedy 
operators would wrap their base record batch stream in a stream which ensures 
it yields back to the scheduler after every `coop_budget` record batches. This 
would only kick in after `coop_budget` batches are processed without yielding. 
If the underlying stream yields, the coop budget gets reset. 
   
   ### Describe alternatives you've considered
   
   This can be done outside of DataFusion by inserting the cooperative stream 
at various places but it would be nice if this were built-in to the engine
   
   ### Additional context
   
   I think this is probably not a big issue if you are setting the partition 
parallelism to the number of CPU cores since the IO is fairly well pipelined 
inside `ParquetExec` and other operators which are doing IO, but we have found 
that in network-IO-heavy workloads (eg reading from object storage) scheduling 
one partition per core leaves the executors underutilized in most cases. 
   
   The goal of this feature would be to be able to oversubscribe the cores to 
effectively take advantage of IO parallelism while avoiding horrendous tail 
latencies in particularly CPU-intensive queries. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to