Re: [Format] Semantics for dictionary batches in streams

Micah Kornfield Thu, 29 Aug 2019 00:24:34 -0700

>
>
> > I was thinking the file format must satisfy one of two conditions:
> > 1.  Exactly one dictionarybatch per encoded column
> > 2.  DictionaryBatches are interleaved correctly.


Could you clarify?

I think you clarified it very well :) My motivation for suggesting the
additional complexity is I see two use-cases for the file format.  These
roughly correspond with the two options I suggested:
1.  We are encoding data from scratch.  In this case, it seems like all
dictionaries would be built incrementally, not need replacement and we
write them at the end of the file [1]

2.  The data being written out is essentially a "tee" off of some stream
that is generating new dictionaries requiring replacement on the fly (i.e.
reading back two parquet files).

 It might be better to disallow replacements
> in the file format (which does introduce semantic slippage between the
> file and stream formats as Antoine was saying).

It is is certainly possible, to accept the slippage from the stream format
for now and later add this capability, since it should be forwards
compatible.

Thanks,
Micah

[1] There is also medium complexity option where we require one non-delta
dictionary and as many delta dictionaries as the user want.

On Wed, Aug 28, 2019 at 7:50 AM Wes McKinney <[email protected]> wrote:

> On Tue, Aug 27, 2019 at 6:05 PM Micah Kornfield <[email protected]>
> wrote:
> >
> > I was thinking the file format must satisfy one of two conditions:
> > 1.  Exactly one dictionarybatch per encoded column
> > 2.  DictionaryBatches are interleaved correctly.
>
> Could you clarify? In the first case, there is no issue with
> dictionary replacements. I'm not sure about the second case -- if a
> dictionary id appears twice, then you'll see it twice in the file
> footer. I suppose you could look at the file offsets to determine
> whether a dictionary batch precedes a particular record batch block
> (to know which dictionary you should be using), but that's rather
> complicated to implement. It might be better to disallow replacements
> in the file format (which does introduce semantic slippage between the
> file and stream formats as Antoine was saying).
>
> >
> > On Tuesday, August 27, 2019, Wes McKinney <[email protected]> wrote:
> >
> > > On Tue, Aug 27, 2019 at 3:55 PM Antoine Pitrou <[email protected]>
> wrote:
> > > >
> > > >
> > > > Le 27/08/2019 à 22:31, Wes McKinney a écrit :
> > > > > So the current situation we have right now in C++ is that if we
> tried
> > > > > to create an IPC stream from a sequence of record batches that
> don't
> > > > > all have the same dictionary, we'd run into two scenarios:
> > > > >
> > > > > * Batches that either have a prefix of a prior-observed
> dictionary, or
> > > > > the prior dictionary is a prefix of their dictionary. For example,
> > > > > suppose that the dictionary sent for an id was ['A', 'B', 'C'] and
> > > > > then there's a subsequent batch with ['A', 'B', 'C', 'D', 'E']. In
> > > > > such case we could compute and send a delta batch
> > > > >
> > > > > * Batches with a dictionary that is a permutation of values, and
> > > > > possibly new unique values.
> > > > >
> > > > > In this latter case, without the option of replacing an existing
> ID in
> > > > > the stream, we would have to do a unification / permutation of
> indices
> > > > > and then also possibly send a delta batch. We should probably have
> > > > > code at some point that deals with both cases, but in the meantime
> I
> > > > > would like to allow dictionaries to be redefined in this case.
> Seems
> > > > > like we might need a vote to formalize this?
> > > >
> > > > Isn't the stream format deviating from the file format then?  In the
> > > > file format, IIUC, dictionaries can appear after the respective
> record
> > > > batches, so there's no way to tell whether the original or redefined
> > > > version of a dictionary is being referred to.
> > >
> > > You make a good point -- we can consider changes to the file format to
> > > allow for record batches to have different dictionaries. Even handling
> > > delta dictionaries with the current file format would be a bit tedious
> > > (though not indeterminate)
> > >
> > > > Regards
> > > >
> > > > Antoine.
> > >
>

Re: [Format] Semantics for dictionary batches in streams

Reply via email to