westonpace commented on pull request #11588:
URL: https://github.com/apache/arrow/pull/11588#issuecomment-959828500


   That makes sense and I agree it could hamper performance.  One thing we do 
not have very well tested and benchmarked yet is multi-query performance.  Most 
of our work is on making a single query run as fast as possible but once there 
are multiple queries running at the same time then issues like this will 
presumably start to surface.
   
   I think we can get away with always applying RANDOM but I'm not certain for 
memory mapped files.  Details:
   
   So with most of our readers (soon to be all I hope) we don't really need the 
OS to do prefetching for us.  Even if it was helpful it isn't something we can 
rely on because remote filesystems (e.g. S3, GCS) will suffer.  We typically 
know exactly what ranges of data we want to access and have a pretty good idea 
(e.g. batch readahead) when we need to start loading the data.
   
   That's why I was really hoping we could do a combination of RANDOM (please 
don't prefetch for me) and WILLNEED (please prefetch this specifically starting 
now).  For regular files, it probably isn't that big of a deal.  We should 
probably always apply RANDOM.  We don't really need prefetching because we are 
doing asynchronous reads on the I/O thread pool and we kick those off early.  
We do our own plugging and batching of requests with ReadRangeCache.  We are 
manually duplicating what the OS is doing (since we can't rely on it for remote 
filesystems) and it works ok as far as I can tell.
   
   So then I suspect the issue is really just with memory mapped files.  
Because ReadAt is a no-op as far as the OS is concerned it won't know that it 
needs to load those pages in.  We do that today with WILLNEED.  My original 
assumption would have been that prefetching wasn't applied with mmap'd files 
but it appears that is false.  So is there any way we can disable prefetching 
but still inform the OS that it needs to load a specific range of pages into 
RAM?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to