Re: [PR] Introduce morsel-driven Parquet scan [datafusion]

via GitHub Sun, 01 Mar 2026 07:57:15 -0800


Dandandan commented on PR #20481:
URL: https://github.com/apache/datafusion/pull/20481#issuecomment-3980342529


   > We currently load the page index / bloom filter info for all row groups in 
one IO right?
   
   Mostly I think, I think the current code loads and caches the page index for 
all groups in the metadata (when reading one `PartitionedFile` with a file 
range). When two partitions run the same file range I think it will only be 
cached _after_ it is loaded so it might do some redundant loads.
   
   > I imagine the key is to make IO operations large enough: if the page index 
metadata for a single row group is 2kB that's a waste of IO. If it's 4MB doing 
8 row groups x 4MB at once ~= individual 4MB requests (the latter may even be 
faster). But that depends on the storage...
   
   Yes - I guess on SSD/fast storage it could be worth it to do it more lazily 
(both adding a bit more parallelism in case of smaller number of files and 
avoiding having to wait on final "morselization" tasks slowing down), but maybe 
hard to see 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Introduce morsel-driven Parquet scan [datafusion]

Reply via email to