Can we discuss the delta dictionary issue a bit more? I admit I don't
share that same concerns.

>From the perspective of a file and stream producer, the code paths
should be nearly identical. The differences with the file format are:

* Magic numbers to detect that it is the "file format"
* Accumulated metadata at the footer

If a file has any dictionaries at all, then they must all be
reconstructed before reading a record batch. So let's say we have a
file like

DICTIONARY ID=0, isDelta=FALSE
BATCH 0
BATCH 1
BATCH 2
DICTIONARY ID=0, isDelta=TRUE
BATCH 3
DICTIONARY ID=0, isDelta=TRUE
BATCH 4

I do not see any harm in this -- the only downside is that you won't
know what "state" the dictionary was in for the first 3 batches.
Viewing dictionary encoding strictly as a data representation method,
the batches 0-2 and 3 represent the same data even if their in-memory
dictionaries are larger than they were than the moment in which they
were written

Note that the code path for "processing" the dictionaries as a first
step will use the same code as the stream path. It should not be a
great deal of work to write test cases for this

On Thu, Oct 24, 2019 at 11:06 AM Micah Kornfield <emkornfi...@gmail.com> wrote:
>
> Hi Antoine,
> There is a defined order for dictionaries in metadata.  What isn't well
> defined is relative ordering between record batches and Delta dictionaries.
>
>  However, this point seems confusing. I can't think of a real-world use
> case we're it would be valuable enough to include, so I will remove Delta
> dictionaries.
>
> So let's cancel this vote and I'll start a new one after the update.
>
> Thanks,
> Micah
>
> On Thursday, October 24, 2019, Antoine Pitrou <anto...@python.org> wrote:
>
> >
> > Le 24/10/2019 à 04:39, Micah Kornfield a écrit :
> > >
> > > 3.  Clarifies that the file format, can only contain 1 "NON-delta"
> > > dictionary batch and multiple "delta" dictionary batches.
> >
> > This is a bit weird.  If the file format can carry delta dictionaries,
> > it means order is significant, so it may as well contain dictionary
> > redefinitions.
> >
> > If the file format is meant to be truly readable in random order, then
> > it should also forbid delta dictionaries.
> >
> > Regards
> >
> > Antoine.
> >

Reply via email to