Re: [DISCUSS] Dictionary Encoding Clarifications/Future Proofing

Wes McKinney Tue, 12 Nov 2019 14:02:01 -0800

Hi Antoine,

Each *record batch* is intended to be readable in random order. To read any
record batch requires loading the dictionaries indicated in the schema, so
appending the deltas as part of this process does not seem like it would
introduce hardship given that such logic is needed to properly handle the
stream format. Dictionary replacements in the file format (at least as
currently conceived) does not seem possible.



On Tue, Nov 12, 2019, 10:13 AM Antoine Pitrou <[email protected]> wrote:

>
> Hi,
>
> Sorry for the delay.
>
> My high-level question is the following:  is the file format intended to
> be readable in random order (rather than having to read through it in
> sequence as with the stream format)?  If so, how does this play with the
> fact that there potentially are delta dictionaries in the "stream"?
>
> Regards
>
> Antoine.
>
>
> Le 30/10/2019 à 21:11, Wes McKinney a écrit :
> > Returning to this discussion as there seems to lack consensus in the
> vote thread
> >
> > Copying Micah's proposals in the VOTE thread here, I wanted to state
> > my opinions so we can discuss further and see where there is potential
> > disagreement
> >
> > 1.  It is not required that all dictionary batches occur at the beginning
> > of the IPC stream format (if a the first record batch has an all null
> > dictionary encoded column, the null column's dictionary might not be sent
> > until later in the stream).
> >
> > This seems preferable to requiring a placeholder empty dictionary
> > batch. This does mean more to test but the integration tests will
> > force the issue
> >
> > 2.  A second dictionary batch for the same ID that is not a "delta batch"
> > in an IPC stream indicates the dictionary should be replaced.
> >
> > Agree.
> >
> > 3.  Clarifies that the file format, can only contain 1 "NON-delta"
> > dictionary batch and multiple "delta" dictionary batches.
> >
> > Agree -- it is also worth stating explicitly that dictionary
> > replacements are not allowed in the file format.
> >
> > In the file format, all the dictionaries must be "loaded" up front.
> > The code path for loading the dictionaries ideally should use nearly
> > the same code as the stream-reader code that sees follow-up dictionary
> > batches interspersed in the stream. The only downside is that it will
> > not be possible to exactly preserve the dictionary "state" as of each
> > record batch being written.
> >
> > So if we had a file containing
> >
> > DICTIONARY ID=0
> > RECORD BATCH
> > RECORD BATCH
> > DICTIONARY DELTA ID=0
> > RECORD BATCH
> > RECORD BATCH
> >
> > Then after processing/loading the dictionaries, the first two record
> > batches will have a dictionary that is "larger" (on account of the
> > delta) than when they were written. Since dictionaries are
> > fundamentally about data representation, they still represent the same
> > data so I think this is acceptable.
> >
> > 4.  Add an enum to dictionary metadata for possible future changes in
> what
> > format dictionary batches can be sent. (the most likely would be an array
> > Map<Int, Value>).  An enum is needed as a place holder to allow for
> forward
> > compatibility past the release 1.0.0.
> >
> > I'm least sure about this but I do not think it is harmful to have a
> > forward-compatible "escape hatch" for future evolutions in dictionary
> > encoding.
> >
> > On Wed, Oct 16, 2019 at 2:57 AM Micah Kornfield <[email protected]>
> wrote:
> >>
> >> I'll plan on starting a vote in the next day or two if there are no
> further
> >> objections/comments.
> >>
> >> On Sun, Oct 13, 2019 at 11:06 AM Micah Kornfield <[email protected]
> >
> >> wrote:
> >>
> >>> I think the only point asked on the PR that I think is worth
> discussing is
> >>> assumptions about dictionaries at the beginning of streams.
> >>>
> >>> There are two options:
> >>> 1.  Based on the current wording, it does not seem that all
> dictionaries
> >>> need to be at the beginning of the stream if they aren't made use of
> in the
> >>> first record batch (i.e. a dictionary encoded column is all null in the
> >>> first record batch).
> >>> 2.  We require a dictionary batch for each dictionary at the beginning
> of
> >>> the stream (and require implementations to send an empty batch if they
> >>> don't have the dictionary available).
> >>>
> >>> The current proposal in the PR is option #1.
> >>>
> >>> Thanks,
> >>> Micah
> >>>
> >>> On Sat, Oct 5, 2019 at 4:01 PM Micah Kornfield <[email protected]>
> >>> wrote:
> >>>
> >>>> I've opened a pull request [1] to clarify some recent conversations
> about
> >>>> semantics/edge cases for dictionary encoding [2][3] around interleaved
> >>>> batches and when isDelta=False.
> >>>>
> >>>> Specifically, it proposes isDelta=False indicates dictionary
> >>>> replacement.  For the file format, only one isDelta=False batch is
> allowed
> >>>> per file and isDelta=true batches are applied in the order supplied
> file
> >>>> footer.
> >>>>
> >>>> In addition, I've added a new enum to DictionaryEncoding to preserve
> >>>> future compatibility in case we want to expand dictionary encoding to
> be an
> >>>> explicit mapping from "ID" to "VALUE" as discussed in [4].
> >>>>
> >>>> Once people have had a change to review and come to a consensus. I
> will
> >>>> call a formal vote to approve the change commit the change.
> >>>>
> >>>> Thanks,
> >>>> Micah
> >>>>
> >>>> [1] https://github.com/apache/arrow/pull/5585
> >>>> [2]
> >>>>
> https://lists.apache.org/thread.html/9734b71bc12aca16eb997388e95105bff412fdaefa4e19422f477389@%3Cdev.arrow.apache.org%3E
> >>>> [3]
> >>>>
> https://lists.apache.org/thread.html/5c3c9346101df8d758e24664638e8ada0211d310ab756a89cde3786a@%3Cdev.arrow.apache.org%3E
> >>>> [4]
> >>>>
> https://lists.apache.org/thread.html/15a4810589b2eb772bce5b2372970d9d93badbd28999a1bbe2af418a@%3Cdev.arrow.apache.org%3E
> >>>>
> >>>>
>

Re: [DISCUSS] Dictionary Encoding Clarifications/Future Proofing

Reply via email to