westonpace commented on pull request #11616: URL: https://github.com/apache/arrow/pull/11616#issuecomment-1007855423
> I wonder, depending on selectivity and batch size, how well this stacks up to just pre-caching the entire record batch (header + body) especially on S3, disregarding the column filter. Having to load the record batch metadata interspersed across the entire file is a disadvantage compared to Parquet, which stores all that in the footer (so we can coalesce reads of data across row groups) whereas here we're only coalescing within each record batch. @lidavidm Fair question. I've been wondering if we might want to someday investigate a variant of the Arrow IPC file format that stores batch lengths in the footer. The schema itself is there so if we had the batch lengths (which should only be 8 bytes * num_batches) then we could: A. Have O(1) random access to an individual row B. Know from just reading the footer all the ranges that we need to access That being said, this seems less important when files get large enough. Even on S3 the metadata fetch is only a very small fraction of the total access time. I don't plan on investigating that as part of this PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org