Can we discuss the delta dictionary issue a bit more? I admit I don't share that same concerns.
>From the perspective of a file and stream producer, the code paths should be nearly identical. The differences with the file format are: * Magic numbers to detect that it is the "file format" * Accumulated metadata at the footer If a file has any dictionaries at all, then they must all be reconstructed before reading a record batch. So let's say we have a file like DICTIONARY ID=0, isDelta=FALSE BATCH 0 BATCH 1 BATCH 2 DICTIONARY ID=0, isDelta=TRUE BATCH 3 DICTIONARY ID=0, isDelta=TRUE BATCH 4 I do not see any harm in this -- the only downside is that you won't know what "state" the dictionary was in for the first 3 batches. Viewing dictionary encoding strictly as a data representation method, the batches 0-2 and 3 represent the same data even if their in-memory dictionaries are larger than they were than the moment in which they were written Note that the code path for "processing" the dictionaries as a first step will use the same code as the stream path. It should not be a great deal of work to write test cases for this On Thu, Oct 24, 2019 at 11:06 AM Micah Kornfield <emkornfi...@gmail.com> wrote: > > Hi Antoine, > There is a defined order for dictionaries in metadata. What isn't well > defined is relative ordering between record batches and Delta dictionaries. > > However, this point seems confusing. I can't think of a real-world use > case we're it would be valuable enough to include, so I will remove Delta > dictionaries. > > So let's cancel this vote and I'll start a new one after the update. > > Thanks, > Micah > > On Thursday, October 24, 2019, Antoine Pitrou <anto...@python.org> wrote: > > > > > Le 24/10/2019 à 04:39, Micah Kornfield a écrit : > > > > > > 3. Clarifies that the file format, can only contain 1 "NON-delta" > > > dictionary batch and multiple "delta" dictionary batches. > > > > This is a bit weird. If the file format can carry delta dictionaries, > > it means order is significant, so it may as well contain dictionary > > redefinitions. > > > > If the file format is meant to be truly readable in random order, then > > it should also forbid delta dictionaries. > > > > Regards > > > > Antoine. > >