Le 10/05/2022 à 04:36, Andrew Piskorski a écrit :
On Mon, May 09, 2022 at 07:00:47PM +0200, Antoine Pitrou wrote:
Generally, the Arrow IPC file/stream formats are designed for large
data. If you have many very small files you might try to rethink how you
store your data on disk.
Ah. Is this because of the overhead of mmap itself, or the metadata
that must be read separately for each file, (or both)?
Because no particular effort was spent to optimize per-file overhead
(and, yes, metadata must be read independently for each file). By the
way, the same thing can be said of Parquet files.
Would creating my files with write_dataset() instead of write_feather()
help?
I don't think that would change anything, assuming you end up with the
same set of files at the end. What could improve things is *reading* the
data as a dataset, as the datasets layer is able to parallelize reads to
cover latencies.
Btw, I have no problem if Linux decides to pre-fetch my mmap-ed data;
that's what mmap is for after all. What I DON'T want, is for Arrow to
WAIT for that data to actually be fetched. Or at least I want it to
wait as little as possible, as presumably it must read some metadata.
Are there ways I should minimize the amount of (possibly redundant)
metadata Arrow needs to read?
If possible, I would suggest writing files incrementally using the IPC
stream format, which could allow you to consolidate the data in a
smaller number of files. Whether that's possible depends on how the data
is produced, of course (do these files correspond to distinct
observations in time?).