Dandandan commented on PR #20481: URL: https://github.com/apache/datafusion/pull/20481#issuecomment-3980342529
> We currently load the page index / bloom filter info for all row groups in one IO right? Mostly I think, I think the current code loads and caches the page index for all groups in the metadata (when reading one `PartitionedFile` with a file range). When two partitions run the same file range I think it will only be cached _after_ it is loaded so it might do some redundant loads. > I imagine the key is to make IO operations large enough: if the page index metadata for a single row group is 2kB that's a waste of IO. If it's 4MB doing 8 row groups x 4MB at once ~= individual 4MB requests (the latter may even be faster). But that depends on the storage... Yes - I guess on SSD/fast storage it could be worth it to do it more lazily (both adding a bit more parallelism in case of smaller number of files and avoiding having to wait on final "morselization" tasks slowing down), but maybe hard to see -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
