westonpace commented on pull request #11588: URL: https://github.com/apache/arrow/pull/11588#issuecomment-959828500
That makes sense and I agree it could hamper performance. One thing we do not have very well tested and benchmarked yet is multi-query performance. Most of our work is on making a single query run as fast as possible but once there are multiple queries running at the same time then issues like this will presumably start to surface. I think we can get away with always applying RANDOM but I'm not certain for memory mapped files. Details: So with most of our readers (soon to be all I hope) we don't really need the OS to do prefetching for us. Even if it was helpful it isn't something we can rely on because remote filesystems (e.g. S3, GCS) will suffer. We typically know exactly what ranges of data we want to access and have a pretty good idea (e.g. batch readahead) when we need to start loading the data. That's why I was really hoping we could do a combination of RANDOM (please don't prefetch for me) and WILLNEED (please prefetch this specifically starting now). For regular files, it probably isn't that big of a deal. We should probably always apply RANDOM. We don't really need prefetching because we are doing asynchronous reads on the I/O thread pool and we kick those off early. We do our own plugging and batching of requests with ReadRangeCache. We are manually duplicating what the OS is doing (since we can't rely on it for remote filesystems) and it works ok as far as I can tell. So then I suspect the issue is really just with memory mapped files. Because ReadAt is a no-op as far as the OS is concerned it won't know that it needs to load those pages in. We do that today with WILLNEED. My original assumption would have been that prefetching wasn't applied with mmap'd files but it appears that is false. So is there any way we can disable prefetching but still inform the OS that it needs to load a specific range of pages into RAM? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
