Re: [VOTE] Clarifications and forward compatibility changes for Dictionary Encoding

Wes McKinney Wed, 30 Oct 2019 13:15:34 -0700

I wrote in on the original DISCUSS thread. I believe Antoine is
unavailable this week, but hopefully we can drive the discussion to a
consensus point next week so we can vote


On Sat, Oct 26, 2019 at 12:01 AM Micah Kornfield <emkornfi...@gmail.com> wrote:
>
> I think at least the wording was confusing because you raised questions on 
> the PR and Antoine commented here.
>
> I agree with your analysis that it probably would not be hard to support.  
> But don't feel too strongly either way on this particular point, aside from 
> coming to a resolution.   If I had to choose I'd prefer allowing Delta 
> dictionaries in files.
>
> On Friday, October 25, 2019, Wes McKinney <wesmck...@gmail.com> wrote:
>>
>> Can we discuss the delta dictionary issue a bit more? I admit I don't
>> share that same concerns.
>>
>> From the perspective of a file and stream producer, the code paths
>> should be nearly identical. The differences with the file format are:
>>
>> * Magic numbers to detect that it is the "file format"
>> * Accumulated metadata at the footer
>>
>> If a file has any dictionaries at all, then they must all be
>> reconstructed before reading a record batch. So let's say we have a
>> file like
>>
>> DICTIONARY ID=0, isDelta=FALSE
>> BATCH 0
>> BATCH 1
>> BATCH 2
>> DICTIONARY ID=0, isDelta=TRUE
>> BATCH 3
>> DICTIONARY ID=0, isDelta=TRUE
>> BATCH 4
>>
>> I do not see any harm in this -- the only downside is that you won't
>> know what "state" the dictionary was in for the first 3 batches.
>> Viewing dictionary encoding strictly as a data representation method,
>> the batches 0-2 and 3 represent the same data even if their in-memory
>> dictionaries are larger than they were than the moment in which they
>> were written
>>
>> Note that the code path for "processing" the dictionaries as a first
>> step will use the same code as the stream path. It should not be a
>> great deal of work to write test cases for this
>>
>> On Thu, Oct 24, 2019 at 11:06 AM Micah Kornfield <emkornfi...@gmail.com> 
>> wrote:
>> >
>> > Hi Antoine,
>> > There is a defined order for dictionaries in metadata.  What isn't well
>> > defined is relative ordering between record batches and Delta dictionaries.
>> >
>> >  However, this point seems confusing. I can't think of a real-world use
>> > case we're it would be valuable enough to include, so I will remove Delta
>> > dictionaries.
>> >
>> > So let's cancel this vote and I'll start a new one after the update.
>> >
>> > Thanks,
>> > Micah
>> >
>> > On Thursday, October 24, 2019, Antoine Pitrou <anto...@python.org> wrote:
>> >
>> > >
>> > > Le 24/10/2019 à 04:39, Micah Kornfield a écrit :
>> > > >
>> > > > 3.  Clarifies that the file format, can only contain 1 "NON-delta"
>> > > > dictionary batch and multiple "delta" dictionary batches.
>> > >
>> > > This is a bit weird.  If the file format can carry delta dictionaries,
>> > > it means order is significant, so it may as well contain dictionary
>> > > redefinitions.
>> > >
>> > > If the file format is meant to be truly readable in random order, then
>> > > it should also forbid delta dictionaries.
>> > >
>> > > Regards
>> > >
>> > > Antoine.
>> > >

Re: [VOTE] Clarifications and forward compatibility changes for Dictionary Encoding

Reply via email to