Re: No replacement dictionaries supported in pyarrow?

Wes McKinney Fri, 19 Mar 2021 16:54:00 -0700

Part of the rationale for the file format was to enable custom
applications to put indexing structures in the file metadata. I still
think this is useful and it's hard for us to know exactly how people
are using this out in the wild. If you don't do this, then you must do
a bunch of IPC reconstruction to find your needle in the haystack. And
Arrow data of course can be memory mapped, so I don't think that
having large datasets is alone an argument to use Parquet, which is
much more expensive to do needle-in-a-haystack operations.


An example application in the financial world is to sort data by time
and build a time series index to facilitate point-in-time lookups or
other time series analytics.

(Note that Parquet in the last few years implemented some indexing
data structures to try to account for Parquet's lack of random access
in its original design.)

On Fri, Mar 19, 2021 at 12:57 PM Antoine Pitrou <[email protected]> wrote:
>
>
> One more general question is whether the file format is really
> beneficial over the stream format in practice.  I understand the
> theoretical argument for direct access to specific batches, but are
> there situations where it really matters?  Intuitively, it seems to me
> that if your data is really large, you may be better off with a more
> space-optimized format such as Parquet.
>
>
> Le 19/03/2021 à 19:49, Wes McKinney a écrit :
> > Okay, let’s open an issue then to address that at some point. What I recall
> > from our last discussion was that the dictionaries would be “processed”
> > when beginning to read the file, appending all the deltas to yield one set
> > of dictionaries for reassembly. The downside is that the “partial
> > dictionaries” that existed at the time that the file was written are not
> > recoverable, but that seems like an acceptable compromise.
> >
> > On Fri, Mar 19, 2021 at 10:34 AM Antoine Pitrou <[email protected]> wrote:
> >
> >>
> >> Le 19/03/2021 à 13:37, Wes McKinney a écrit :
> >>> I am also under the impression that the file format is supposed to
> >> support
> >>> deltas, but not replacements. Is this not implemented in C++?
> >>
> >> Definitely not.  Also I was not aware that the file format was supposed
> >> to support deltas.
> >>
> >> Regards
> >>
> >> Antoine.
> >>
> >

Re: No replacement dictionaries supported in pyarrow?

Reply via email to