westonpace commented on pull request #11616:
URL: https://github.com/apache/arrow/pull/11616#issuecomment-1007855423


   > I wonder, depending on selectivity and batch size, how well this stacks up 
to just pre-caching the entire record batch (header + body) especially on S3, 
disregarding the column filter. Having to load the record batch metadata 
interspersed across the entire file is a disadvantage compared to Parquet, 
which stores all that in the footer (so we can coalesce reads of data across 
row groups) whereas here we're only coalescing within each record batch.
   
   @lidavidm 
   
   Fair question.  I've been wondering if we might want to someday investigate 
a variant of the Arrow IPC file format that stores batch lengths in the footer. 
 The schema itself is there so if we had the batch lengths (which should only 
be 8 bytes * num_batches) then we could:
   
   A. Have O(1) random access to an individual row
   B. Know from just reading the footer all the ranges that we need to access
   
   That being said, this seems less important when files get large enough.  
Even on S3 the metadata fetch is only a very small fraction of the total access 
time.  I don't plan on investigating that as part of this PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to