Re: mmap only, read data later?

Antoine Pitrou Tue, 10 May 2022 00:39:14 -0700


Le 10/05/2022 à 04:36, Andrew Piskorski a écrit :

On Mon, May 09, 2022 at 07:00:47PM +0200, Antoine Pitrou wrote:

Generally, the Arrow IPC file/stream formats are designed for large
data. If you have many very small files you might try to rethink how you
store your data on disk.


Ah.  Is this because of the overhead of mmap itself, or the metadata
that must be read separately for each file, (or both)?

Because no particular effort was spent to optimize per-file overhead(and, yes, metadata must be read independently for each file). By theway, the same thing can be said of Parquet files.

Would creating my files with write_dataset() instead of write_feather()
help?

I don't think that would change anything, assuming you end up with thesame set of files at the end. What could improve things is *reading* thedata as a dataset, as the datasets layer is able to parallelize reads tocover latencies.

Btw, I have no problem if Linux decides to pre-fetch my mmap-ed data;
that's what mmap is for after all.  What I DON'T want, is for Arrow to
WAIT for that data to actually be fetched.  Or at least I want it to
wait as little as possible, as presumably it must read some metadata.
Are there ways I should minimize the amount of (possibly redundant)
metadata Arrow needs to read?

If possible, I would suggest writing files incrementally using the IPCstream format, which could allow you to consolidate the data in asmaller number of files. Whether that's possible depends on how the datais produced, of course (do these files correspond to distinctobservations in time?).

Re: mmap only, read data later?

Reply via email to