niyue edited a comment on pull request #11588: URL: https://github.com/apache/arrow/pull/11588#issuecomment-961477174
@pitrou > you actually access only a tiny bit of the record batches' data you just asked to read This is what I considered different. To make things simpler, we can assume there is only one array in the record batch, so that we don't have to discuss if multiple arrays are read in this case. The user program did only access a tiny bit of the record batch (accessing one element in the array), since it is mmaped, and I expect only 1 page to be loaded as it is the minimum IO required by OS (or is this an incorrect expectation?), however, I found simply calling `array[i]` to read one value, multiple pages are loaded by the OS, which is unexpected. > because they are only doing very sparse reads and ignoring most of the remaining data What I think different is, the user program doesn't want to ignore most of the remaining data, and according to the API it calls, it tries to read 1 page (1 array element actually) and uses the content in this page. From what I consider, this doesn't ignore any remaining data: it does ignore the remaining `4KB - one array element size` data, but this is expected and not what I complained here. The prefetched pages are what I complained, since it happened internally and automatically, I didn't expect it to happen (`why reading mmaped array[i] will lead to multiple pages of IO?` was my question when troubleshooting this problem), and I would like to find an approach to prevent it from happening in this case. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org