If you are reading this as a dataset, and you are not partitioning on your disk, then it is going to read the entire content of every file, because there is no statistics-based partitioning currently enabled with IPC files.
If you have some kind of filter, and you can partition your data on the same columns you are using in your filter, then you should be able to reduce the total amount of I/O by allowing the dataset reader to skip entire files based on the pathname. On Mon, May 9, 2022 at 9:39 PM Antoine Pitrou <anto...@python.org> wrote: > > > Le 10/05/2022 à 04:36, Andrew Piskorski a écrit : > > On Mon, May 09, 2022 at 07:00:47PM +0200, Antoine Pitrou wrote: > > > >> Generally, the Arrow IPC file/stream formats are designed for large > >> data. If you have many very small files you might try to rethink how you > >> store your data on disk. > > > > Ah. Is this because of the overhead of mmap itself, or the metadata > > that must be read separately for each file, (or both)? > > Because no particular effort was spent to optimize per-file overhead > (and, yes, metadata must be read independently for each file). By the > way, the same thing can be said of Parquet files. > > > Would creating my files with write_dataset() instead of write_feather() > > help? > > I don't think that would change anything, assuming you end up with the > same set of files at the end. What could improve things is *reading* the > data as a dataset, as the datasets layer is able to parallelize reads to > cover latencies. > > > Btw, I have no problem if Linux decides to pre-fetch my mmap-ed data; > > that's what mmap is for after all. What I DON'T want, is for Arrow to > > WAIT for that data to actually be fetched. Or at least I want it to > > wait as little as possible, as presumably it must read some metadata. > > Are there ways I should minimize the amount of (possibly redundant) > > metadata Arrow needs to read? > > If possible, I would suggest writing files incrementally using the IPC > stream format, which could allow you to consolidate the data in a > smaller number of files. Whether that's possible depends on how the data > is produced, of course (do these files correspond to distinct > observations in time?). >