Part of the rationale for the file format was to enable custom applications to put indexing structures in the file metadata. I still think this is useful and it's hard for us to know exactly how people are using this out in the wild. If you don't do this, then you must do a bunch of IPC reconstruction to find your needle in the haystack. And Arrow data of course can be memory mapped, so I don't think that having large datasets is alone an argument to use Parquet, which is much more expensive to do needle-in-a-haystack operations.
An example application in the financial world is to sort data by time and build a time series index to facilitate point-in-time lookups or other time series analytics. (Note that Parquet in the last few years implemented some indexing data structures to try to account for Parquet's lack of random access in its original design.) On Fri, Mar 19, 2021 at 12:57 PM Antoine Pitrou <anto...@python.org> wrote: > > > One more general question is whether the file format is really > beneficial over the stream format in practice. I understand the > theoretical argument for direct access to specific batches, but are > there situations where it really matters? Intuitively, it seems to me > that if your data is really large, you may be better off with a more > space-optimized format such as Parquet. > > > Le 19/03/2021 à 19:49, Wes McKinney a écrit : > > Okay, let’s open an issue then to address that at some point. What I recall > > from our last discussion was that the dictionaries would be “processed” > > when beginning to read the file, appending all the deltas to yield one set > > of dictionaries for reassembly. The downside is that the “partial > > dictionaries” that existed at the time that the file was written are not > > recoverable, but that seems like an acceptable compromise. > > > > On Fri, Mar 19, 2021 at 10:34 AM Antoine Pitrou <anto...@python.org> wrote: > > > >> > >> Le 19/03/2021 à 13:37, Wes McKinney a écrit : > >>> I am also under the impression that the file format is supposed to > >> support > >>> deltas, but not replacements. Is this not implemented in C++? > >> > >> Definitely not. Also I was not aware that the file format was supposed > >> to support deltas. > >> > >> Regards > >> > >> Antoine. > >> > >