niyue edited a comment on pull request #11588:
URL: https://github.com/apache/arrow/pull/11588#issuecomment-961477174


   @pitrou 
   > you actually access only a tiny bit of the record batches' data you just 
asked to read
   
   This is what I considered different. To make things simpler, we can assume 
there is only one array in the record batch, so that we don't have to discuss 
if multiple arrays are read in this case. The user program did only access a 
tiny bit of the record batch (accessing one element in the array), since it is 
mmaped, and I expect only 1 page to be loaded as it is the minimum IO required 
by OS (or is this an incorrect expectation?), however, I found simply calling 
`array[i]` to read one value, multiple pages are loaded by the OS, which is 
unexpected. 
   
   > because they are only doing very sparse reads and ignoring most of the 
remaining data
   
   What I think different is, the user program doesn't want to ignore most of 
the remaining data, and according to the API it calls, it tries to read 1 page 
(1 array element actually) and uses the content in this page. From what I 
consider, this doesn't ignore any remaining data: it does ignore the remaining 
`4KB - one array element size` data, but this is expected and not what I 
complained here. The prefetched pages are what I complained, since it happened 
internally and automatically, I didn't expect it to happen (`why reading mmaped 
array[i] will lead to multiple pages of IO?` was my question when 
troubleshooting this problem), and I would like to find an approach to prevent 
it from happening in this case.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to