I think at least the wording was confusing because you raised questions on the PR and Antoine commented here.
I agree with your analysis that it probably would not be hard to support. But don't feel too strongly either way on this particular point, aside from coming to a resolution. If I had to choose I'd prefer allowing Delta dictionaries in files. On Friday, October 25, 2019, Wes McKinney <wesmck...@gmail.com> wrote: > Can we discuss the delta dictionary issue a bit more? I admit I don't > share that same concerns. > > From the perspective of a file and stream producer, the code paths > should be nearly identical. The differences with the file format are: > > * Magic numbers to detect that it is the "file format" > * Accumulated metadata at the footer > > If a file has any dictionaries at all, then they must all be > reconstructed before reading a record batch. So let's say we have a > file like > > DICTIONARY ID=0, isDelta=FALSE > BATCH 0 > BATCH 1 > BATCH 2 > DICTIONARY ID=0, isDelta=TRUE > BATCH 3 > DICTIONARY ID=0, isDelta=TRUE > BATCH 4 > > I do not see any harm in this -- the only downside is that you won't > know what "state" the dictionary was in for the first 3 batches. > Viewing dictionary encoding strictly as a data representation method, > the batches 0-2 and 3 represent the same data even if their in-memory > dictionaries are larger than they were than the moment in which they > were written > > Note that the code path for "processing" the dictionaries as a first > step will use the same code as the stream path. It should not be a > great deal of work to write test cases for this > > On Thu, Oct 24, 2019 at 11:06 AM Micah Kornfield <emkornfi...@gmail.com> > wrote: > > > > Hi Antoine, > > There is a defined order for dictionaries in metadata. What isn't well > > defined is relative ordering between record batches and Delta > dictionaries. > > > > However, this point seems confusing. I can't think of a real-world use > > case we're it would be valuable enough to include, so I will remove Delta > > dictionaries. > > > > So let's cancel this vote and I'll start a new one after the update. > > > > Thanks, > > Micah > > > > On Thursday, October 24, 2019, Antoine Pitrou <anto...@python.org> > wrote: > > > > > > > > Le 24/10/2019 à 04:39, Micah Kornfield a écrit : > > > > > > > > 3. Clarifies that the file format, can only contain 1 "NON-delta" > > > > dictionary batch and multiple "delta" dictionary batches. > > > > > > This is a bit weird. If the file format can carry delta dictionaries, > > > it means order is significant, so it may as well contain dictionary > > > redefinitions. > > > > > > If the file format is meant to be truly readable in random order, then > > > it should also forbid delta dictionaries. > > > > > > Regards > > > > > > Antoine. > > > >