Re: [PR] Dynamic work scheduling in FileStream [datafusion]

via GitHub Sat, 11 Apr 2026 03:15:56 -0700


Dandandan commented on PR #21351:
URL: https://github.com/apache/datafusion/pull/21351#issuecomment-4229252219


   Really cool! I'll try to allocate some time to this / the base PR.
   
   Let's also collect some follow-up work as well if we haven't yet!
   I think the latest PRs allow us to do things a bit differently and get the 
most out of it!
   
   Here some out the top of my head:
   
   1. Morsel splitting (more parallelism at the tail / small queries) / merging 
(small batch decoding/processing overhead)
   2. Prefetching IO / combining small IO requests (reducing `spawn_blocking` / 
thread switching overhead)
   3. Move batch coalescing in RepartitionExec _before_ rather than after 
sending (reducing channel traffic / improving cache-friendliness)
   4. Implement morsel-based scan for other datasources
   5. Avoid eagerly executing sub-plans (now that we can extract more 
parallelism). Depends at least on 1. 
   
   I am also feeling the change in execution might move bottlenecks to other 
parts (e.g. memory bandwidth, aggregation state, so some optimizations might be 
worth it now that didn't before, because it is easier to hit some limit...).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Dynamic work scheduling in FileStream [datafusion]

Reply via email to