Re: [PR] Dynamic work scheduling in FileStream [datafusion]

via GitHub Wed, 15 Apr 2026 01:53:44 -0700


Dandandan commented on PR #21351:
URL: https://github.com/apache/datafusion/pull/21351#issuecomment-4250603405


   > Awesome, so the PR changes who reads which file at runtime using 
morselizer, would be extremely interesting to try this on many small files 
environments.
   > 
   > Do we expect improvements for even partitions(partition have the similar 
number of files with similar sizes)?
   > 
   > Is it planned to morselize deeper to process row groups in parallel?
   > 
   > This activity actually reminds me of #19815 benchmark.
   
   > Do we expect improvements for even partitions(partition have the similar 
number of files with similar sizes)
   
   In my experience, there is always a some partition skew even for very 
balanced scans on local FS.
   So, this will benefit from morsel-based scan (but only something like 5-10%) 
as long as there are enough morsels to spread the work (at least more than the 
number of cores).
   Object store will almost always benefit (due to huge variation in response 
times).
   OTOH the relative improvement will be less for larger / more evenly balanced 
datasets.
   
   > Is it planned to morselize deeper to process row groups in parallel?
   
   Yes - it is the plan to split morsels into sub-row-group morsels, so smaller 
datasets (e.g. TPC-DS at SF=1 which has single-row group files) or high-cpu 
machines (due to too little parallelism) will benefit more as well. 
   
   Currently parallelism is limited in datasets with few row groups as we can't 
go beyond row groups.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Dynamic work scheduling in FileStream [datafusion]

Reply via email to