I am also under the impression that the file format is supposed to support
deltas, but not replacements. Is this not implemented in C++?

On Thu, Mar 18, 2021 at 9:57 PM Nate Bauernfeind <nate.bauernfe...@gmail.com>
wrote:

> If dictionary replacements were supported, then the IPC file format
> couldn't guarantee random access reads.
>
> Personally, I would like to support a stream-based file format that is a
> series of the Flight protobufs. In my extension of arrow flight, by
> stuffing our state-based data into the app_metadata field on the FlightData
> object, we can't write down a stream natively in the IPC based file format
> (for testing, or sharing the reproduction of an error). In particular, the
> IPC format is based around the flatbuffer payloads instead of the Flight
> protobuf payloads. It might be nice to support an additional type of IPC
> file for stateful streams. If interested, it would be easy to integrate
> with the existing code using a different magic field in the footer (such as
> 'FLGHT1', instead of 'ARROW1'). In addition to the offsets and sizes of
> payloads, it might be nice to indicate the type of payload (RecordBatch vs
> DictionaryBatch, etc). We wouldn't have O(1) random access, but I think in
> the "replay of a stream" scenario, one probably isn't looking for random
> access anyways.
>
> On Thu, Mar 18, 2021 at 9:07 PM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
>
> > Hmm, I noticed this "The IPC file format doesn't support dictionary
> > replacements or deltas." I was under the impression we aimed to support
> > dictionary deltas in the file format.  If not we should remove "Delta
> > dictionaries are applied in the order they appear in the file footer."
> from
> > the specification.
> >
> > On Thu, Mar 18, 2021 at 8:48 AM Antoine Pitrou <anto...@python.org>
> wrote:
> >
> > >
> > > It's a bit more configurable, but basically yes.  See the IPC write
> > > options:
> > >
> >
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/options.h#L73
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > > Le 18/03/2021 à 16:37, Jacob Quinn a écrit :
> > > > Ah, interesting. So to make sure I understand correctly, the C++
> write
> > > > implementation will scan all "batches" and unify all dictionary
> values
> > > > before writing out the schema + dictionary messages? But only when
> > > writing
> > > > the file format? In the streaming case, it would still write
> > > > replacement/delta dictionary messages as needed.
> > > >
> > > > -Jacob
> > > >
> > > > On Thu, Mar 18, 2021 at 9:10 AM Neal Richardson <
> > > neal.p.richard...@gmail.com>
> > > > wrote:
> > > >
> > > >> Somewhat related issue:
> > > https://issues.apache.org/jira/browse/ARROW-10406
> > > >>
> > > >> On Wed, Mar 17, 2021 at 11:22 PM Micah Kornfield <
> > emkornfi...@gmail.com
> > > >
> > > >> wrote:
> > > >>
> > > >>> BTW, this nuance always felt a little strange to me, but would have
> > > >>> required adding additional information to the file format, to
> > > >> disambiguate
> > > >>> when exactly a dictionary was intended to be replaced.
> > > >>>
> > > >>> On Wed, Mar 17, 2021 at 11:19 PM Micah Kornfield <
> > > emkornfi...@gmail.com>
> > > >>> wrote:
> > > >>>
> > > >>>> Hi Jacob,
> > > >>>> There is nuance.  The file format does not support dictionary
> > > >>> replacement,
> > > >>>> the specification [1] why that is currently the case.  Only the
> > > "stream
> > > >>>> format" supports replacement (i.e. no magic number, only schema
> > > >> followed
> > > >>> by
> > > >>>> one or more dictionary/record-batch messages).
> > > >>>>
> > > >>>> -Micah
> > > >>>>
> > > >>>> [1]
> > > https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format
> > > >>>>
> > > >>>> On Wed, Mar 17, 2021 at 11:04 PM Jacob Quinn <
> > quinn.jac...@gmail.com>
> > > >>>> wrote:
> > > >>>>
> > > >>>>> Had an issue come up here:
> > > >>>>>
> > > >>
> > https://github.com/JuliaData/Arrow.jl/issues/129#issuecomment-777350450
> > > >>> .
> > > >>>>>  From the implementation status page, it says C++ supports
> > > replacement
> > > >>>>> dictionaries and that python tracks the C++ implementation. Is
> this
> > > >>> just a
> > > >>>>> pyarrow issue where it specifically doesn't support replacement
> > > >>>>> dictionaries? Or it's not "hooked in" properly?
> > > >>>>>
> > > >>>>> -Jacob
> > > >>>>>
> > > >>>>
> > > >>>
> > > >>
> > > >
> > >
> >
>

Reply via email to