Re: [PR] perf: add prefetching for aggregate multi group by [WIP] [datafusion]

via GitHub Tue, 30 Dec 2025 03:50:31 -0800


alamb commented on PR #19520:
URL: https://github.com/apache/datafusion/pull/19520#issuecomment-3699119234


   In theory, there should be as many partitions as cores in most plans, and 
processing each partition should keep a single core busy. Therefore, 
pre-fetching should not be necessary for CPU intensive tasks like grouping.
   
   However, I have also seen the CPU stall during benchmarks with parquet (aka 
it doesn't keep all CPUs busy). I think at least part of this stalling is due 
to IO -- namely that once the CPU has completed decoding a row group, then it 
stalls waiting for the pages from the next row group by to be read from disk
   
   I found that pre-fetching IO seemed to be pretty effective (aka start 
pre-fetching the next row group before the current one was completely read). I 
hacked up a prototype here that looks promising:
   - https://github.com/apache/datafusion/pull/18391
   
   If you agree, I can polish that one up some more. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] perf: add prefetching for aggregate multi group by [WIP] [datafusion]

Reply via email to