Hi Andrew,
Unfortunately mmap is made to implement “transparent paging”, meaning that the 
OS takes control of when to read pages of the file to and from disk. This means 
that it’s Arrow has no way of controlling when the file is actually read, and 
it’s possible that the OS is prefetching the whole file given files that small. 
That said, I’ve seen before that just the act of doing thousands of mmaps can 
be a significant overhead, as mmap is a fairly expensive system call. 

As for solutions, is there some reason you need mmap? Could you perhaps open an 
InputStream (equivalent to opening each file) for each file and then call 
read_feather later when you actually need it?

Sasha Krassovsky 

> 9 мая 2022 г., в 09:38, Andrew Piskorski <a...@piskorski.com> написал(а):
> 
> Hello, I'm using R package arrow_7.0.0.tar.gz, in R 4.1.1, on Linux
> (Ubuntu 18.04.4 LTS).
> 
> In R, I am mmap-ing many small Arrow files by calling arrow::read_feather()
> with as_data_frame=FALSE on each one.  Compressed with lz4, each file
> is quite small, often only 25 kB or so, but I'll often be mmap-ing
> many thousands of them.  From the time this takes, I suspect that
> Arrow is reading the full contents of each file rather than just
> setting up the mmap, but I don't know how to properly check that.
> 
> I would like to make sure that at this stage, I JUST mmap each file,
> and defer reading their data until later when I actually need it.  Are
> there any settings or arguments I can use to make sure that happens?
> Or ways to verify precisely what is happening?
> 
> I think I found the relevant C++ code in "r/src/io.cpp" and
> "cpp/src/arrow/io/file.cc", but I definitely don't understand its
> performance implications, nor how to control this sort of thing.
> 
> Thanks for your help and advice!
> 
> -- 
> Andrew Piskorski <a...@piskorski.com>

Reply via email to