alamb opened a new issue, #19770:
URL: https://github.com/apache/datafusion/issues/19770

   ### Is your feature request related to a problem or challenge?
   
   At @akurmustafa 's great suggestion on 
https://github.com/apache/datafusion/discussions/18260 I submitted, and was 
accepted to speak at the first TokioConf about using Tokio as the DataFusion 
CPU runtime ngine
   
   Here is more detail
   https://www.tokioconf.com/speakers
   
   Here is the talk summary
   
   > ### Using Tokio for CPU-Bound Tasks (Works Really Well)
   The Tokio runtime at the heart of the Rust async ecosystem is also a good 
choice for CPU-heavy jobs such as those found in analytics engines. We will 
review what makes Tokio a compelling choice for CPU bound workloads, address 
common concerns, and report on our experience using Tokio as the thread 
scheduler for Apache DataFusion
   
   ### Describe the solution you'd like
   
   I want to create this talk / slides in the open. If the talks aren't 
recorded, I will also record a second version of the talk
   
   ### Describe alternatives you've considered
   
   The high level idea will be to summarize the findings in 
   
https://www.influxdata.com/blog/using-rustlangs-async-tokio-runtime-for-cpu-bound-tasks/
   And then refresh the major pitfalls
   
   
   Talk outline:
   1. Analytic DB 101 + Volcano Model: Explain DataFusion execution model (data 
flow graphs, and vectorzed execution)
   1. Explain why people thought using tokio for CPU was bad and the counter 
arguments
   2. Demonstrate how tokio's scheduler effectively implements the 
"get_next_batch()" API on the same thread
   3. Discuss pitfalls
   
   
   Major pitfall 1: Using the same async runtime for IO and CPU bound tasks
   * Explain symptoms (everything just slows down under high concurency) -- the 
theore is that this is due to the network protocol congestion control protocol (
   * Explain solution: use separate runtimes, thread it throgh
   * TODO: find DF example of multiple runtimes
   * TODO: mention the challenge of having to pass a new runtime to different 
IO libraries (object_store, etc)
   
   Major pitfall 2: Hot loops and cancelling
   * Basically summarize the contents of 
https://datafusion.apache.org/blog/2025/06/30/cancellation/ from @pepijnve 
   * Explain symtpoms: Cancelling and the plan keeps going
   * Solution 1: (obvious one) no hot loops
   * Solution 2: (less obvious) need to make sure we periodically yield back to 
the scheduler (otherwise tasks keep running but the scheduer never gets a 
chance to figure out the consumers have been dropped)
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to