Re: [Format] Semantics for dictionary batches in streams

Micah Kornfield Tue, 27 Aug 2019 16:05:37 -0700

I was thinking the file format must satisfy one of two conditions:
1.  Exactly one dictionarybatch per encoded column
2.  DictionaryBatches are interleaved correctly.


On Tuesday, August 27, 2019, Wes McKinney <[email protected]> wrote:

> On Tue, Aug 27, 2019 at 3:55 PM Antoine Pitrou <[email protected]> wrote:
> >
> >
> > Le 27/08/2019 à 22:31, Wes McKinney a écrit :
> > > So the current situation we have right now in C++ is that if we tried
> > > to create an IPC stream from a sequence of record batches that don't
> > > all have the same dictionary, we'd run into two scenarios:
> > >
> > > * Batches that either have a prefix of a prior-observed dictionary, or
> > > the prior dictionary is a prefix of their dictionary. For example,
> > > suppose that the dictionary sent for an id was ['A', 'B', 'C'] and
> > > then there's a subsequent batch with ['A', 'B', 'C', 'D', 'E']. In
> > > such case we could compute and send a delta batch
> > >
> > > * Batches with a dictionary that is a permutation of values, and
> > > possibly new unique values.
> > >
> > > In this latter case, without the option of replacing an existing ID in
> > > the stream, we would have to do a unification / permutation of indices
> > > and then also possibly send a delta batch. We should probably have
> > > code at some point that deals with both cases, but in the meantime I
> > > would like to allow dictionaries to be redefined in this case. Seems
> > > like we might need a vote to formalize this?
> >
> > Isn't the stream format deviating from the file format then?  In the
> > file format, IIUC, dictionaries can appear after the respective record
> > batches, so there's no way to tell whether the original or redefined
> > version of a dictionary is being referred to.
>
> You make a good point -- we can consider changes to the file format to
> allow for record batches to have different dictionaries. Even handling
> delta dictionaries with the current file format would be a bit tedious
> (though not indeterminate)
>
> > Regards
> >
> > Antoine.
>

Re: [Format] Semantics for dictionary batches in streams

Reply via email to