I think at least the wording was confusing because you raised questions on
the PR and Antoine commented here.

I agree with your analysis that it probably would not be hard to support.
But don't feel too strongly either way on this particular point, aside from
coming to a resolution.   If I had to choose I'd prefer allowing Delta
dictionaries in files.

On Friday, October 25, 2019, Wes McKinney <wesmck...@gmail.com> wrote:

> Can we discuss the delta dictionary issue a bit more? I admit I don't
> share that same concerns.
>
> From the perspective of a file and stream producer, the code paths
> should be nearly identical. The differences with the file format are:
>
> * Magic numbers to detect that it is the "file format"
> * Accumulated metadata at the footer
>
> If a file has any dictionaries at all, then they must all be
> reconstructed before reading a record batch. So let's say we have a
> file like
>
> DICTIONARY ID=0, isDelta=FALSE
> BATCH 0
> BATCH 1
> BATCH 2
> DICTIONARY ID=0, isDelta=TRUE
> BATCH 3
> DICTIONARY ID=0, isDelta=TRUE
> BATCH 4
>
> I do not see any harm in this -- the only downside is that you won't
> know what "state" the dictionary was in for the first 3 batches.
> Viewing dictionary encoding strictly as a data representation method,
> the batches 0-2 and 3 represent the same data even if their in-memory
> dictionaries are larger than they were than the moment in which they
> were written
>
> Note that the code path for "processing" the dictionaries as a first
> step will use the same code as the stream path. It should not be a
> great deal of work to write test cases for this
>
> On Thu, Oct 24, 2019 at 11:06 AM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
> >
> > Hi Antoine,
> > There is a defined order for dictionaries in metadata.  What isn't well
> > defined is relative ordering between record batches and Delta
> dictionaries.
> >
> >  However, this point seems confusing. I can't think of a real-world use
> > case we're it would be valuable enough to include, so I will remove Delta
> > dictionaries.
> >
> > So let's cancel this vote and I'll start a new one after the update.
> >
> > Thanks,
> > Micah
> >
> > On Thursday, October 24, 2019, Antoine Pitrou <anto...@python.org>
> wrote:
> >
> > >
> > > Le 24/10/2019 à 04:39, Micah Kornfield a écrit :
> > > >
> > > > 3.  Clarifies that the file format, can only contain 1 "NON-delta"
> > > > dictionary batch and multiple "delta" dictionary batches.
> > >
> > > This is a bit weird.  If the file format can carry delta dictionaries,
> > > it means order is significant, so it may as well contain dictionary
> > > redefinitions.
> > >
> > > If the file format is meant to be truly readable in random order, then
> > > it should also forbid delta dictionaries.
> > >
> > > Regards
> > >
> > > Antoine.
> > >
>

Reply via email to