Re: [Format] Semantics for dictionary batches in streams

Micah Kornfield Mon, 09 Sep 2019 20:06:28 -0700

Yes, I opened a JIRA, I'm going to try to make a proposal that consolidates
all the recent dictionary discussions.


On Mon, Sep 9, 2019 at 12:21 PM Wes McKinney <wesmck...@gmail.com> wrote:

> hi Micah,
>
> I think we should formulate changes to format/Columnar.rst and have a
> vote, what do you think?
>
> On Thu, Aug 29, 2019 at 2:23 AM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
> >>
> >>
> >> > I was thinking the file format must satisfy one of two conditions:
> >> > 1.  Exactly one dictionarybatch per encoded column
> >> > 2.  DictionaryBatches are interleaved correctly.
> >>
> >> Could you clarify?
> >
> > I think you clarified it very well :) My motivation for suggesting the
> additional complexity is I see two use-cases for the file format.  These
> roughly correspond with the two options I suggested:
> > 1.  We are encoding data from scratch.  In this case, it seems like all
> dictionaries would be built incrementally, not need replacement and we
> write them at the end of the file [1]
> >
> > 2.  The data being written out is essentially a "tee" off of some stream
> that is generating new dictionaries requiring replacement on the fly (i.e.
> reading back two parquet files).
> >
> >>  It might be better to disallow replacements
> >> in the file format (which does introduce semantic slippage between the
> >> file and stream formats as Antoine was saying).
> >
> > It is is certainly possible, to accept the slippage from the stream
> format for now and later add this capability, since it should be forwards
> compatible.
> >
> > Thanks,
> > Micah
> >
> > [1] There is also medium complexity option where we require one
> non-delta dictionary and as many delta dictionaries as the user want.
> >
> > On Wed, Aug 28, 2019 at 7:50 AM Wes McKinney <wesmck...@gmail.com>
> wrote:
> >>
> >> On Tue, Aug 27, 2019 at 6:05 PM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
> >> >
> >> > I was thinking the file format must satisfy one of two conditions:
> >> > 1.  Exactly one dictionarybatch per encoded column
> >> > 2.  DictionaryBatches are interleaved correctly.
> >>
> >> Could you clarify? In the first case, there is no issue with
> >> dictionary replacements. I'm not sure about the second case -- if a
> >> dictionary id appears twice, then you'll see it twice in the file
> >> footer. I suppose you could look at the file offsets to determine
> >> whether a dictionary batch precedes a particular record batch block
> >> (to know which dictionary you should be using), but that's rather
> >> complicated to implement. It might be better to disallow replacements
> >> in the file format (which does introduce semantic slippage between the
> >> file and stream formats as Antoine was saying).
> >>
> >> >
> >> > On Tuesday, August 27, 2019, Wes McKinney <wesmck...@gmail.com>
> wrote:
> >> >
> >> > > On Tue, Aug 27, 2019 at 3:55 PM Antoine Pitrou <anto...@python.org>
> wrote:
> >> > > >
> >> > > >
> >> > > > Le 27/08/2019 à 22:31, Wes McKinney a écrit :
> >> > > > > So the current situation we have right now in C++ is that if we
> tried
> >> > > > > to create an IPC stream from a sequence of record batches that
> don't
> >> > > > > all have the same dictionary, we'd run into two scenarios:
> >> > > > >
> >> > > > > * Batches that either have a prefix of a prior-observed
> dictionary, or
> >> > > > > the prior dictionary is a prefix of their dictionary. For
> example,
> >> > > > > suppose that the dictionary sent for an id was ['A', 'B', 'C']
> and
> >> > > > > then there's a subsequent batch with ['A', 'B', 'C', 'D', 'E'].
> In
> >> > > > > such case we could compute and send a delta batch
> >> > > > >
> >> > > > > * Batches with a dictionary that is a permutation of values, and
> >> > > > > possibly new unique values.
> >> > > > >
> >> > > > > In this latter case, without the option of replacing an
> existing ID in
> >> > > > > the stream, we would have to do a unification / permutation of
> indices
> >> > > > > and then also possibly send a delta batch. We should probably
> have
> >> > > > > code at some point that deals with both cases, but in the
> meantime I
> >> > > > > would like to allow dictionaries to be redefined in this case.
> Seems
> >> > > > > like we might need a vote to formalize this?
> >> > > >
> >> > > > Isn't the stream format deviating from the file format then?  In
> the
> >> > > > file format, IIUC, dictionaries can appear after the respective
> record
> >> > > > batches, so there's no way to tell whether the original or
> redefined
> >> > > > version of a dictionary is being referred to.
> >> > >
> >> > > You make a good point -- we can consider changes to the file format
> to
> >> > > allow for record batches to have different dictionaries. Even
> handling
> >> > > delta dictionaries with the current file format would be a bit
> tedious
> >> > > (though not indeterminate)
> >> > >
> >> > > > Regards
> >> > > >
> >> > > > Antoine.
> >> > >
>

Re: [Format] Semantics for dictionary batches in streams

Reply via email to