I wrote in on the original DISCUSS thread. I believe Antoine is unavailable this week, but hopefully we can drive the discussion to a consensus point next week so we can vote
On Sat, Oct 26, 2019 at 12:01 AM Micah Kornfield <emkornfi...@gmail.com> wrote: > > I think at least the wording was confusing because you raised questions on > the PR and Antoine commented here. > > I agree with your analysis that it probably would not be hard to support. > But don't feel too strongly either way on this particular point, aside from > coming to a resolution. If I had to choose I'd prefer allowing Delta > dictionaries in files. > > On Friday, October 25, 2019, Wes McKinney <wesmck...@gmail.com> wrote: >> >> Can we discuss the delta dictionary issue a bit more? I admit I don't >> share that same concerns. >> >> From the perspective of a file and stream producer, the code paths >> should be nearly identical. The differences with the file format are: >> >> * Magic numbers to detect that it is the "file format" >> * Accumulated metadata at the footer >> >> If a file has any dictionaries at all, then they must all be >> reconstructed before reading a record batch. So let's say we have a >> file like >> >> DICTIONARY ID=0, isDelta=FALSE >> BATCH 0 >> BATCH 1 >> BATCH 2 >> DICTIONARY ID=0, isDelta=TRUE >> BATCH 3 >> DICTIONARY ID=0, isDelta=TRUE >> BATCH 4 >> >> I do not see any harm in this -- the only downside is that you won't >> know what "state" the dictionary was in for the first 3 batches. >> Viewing dictionary encoding strictly as a data representation method, >> the batches 0-2 and 3 represent the same data even if their in-memory >> dictionaries are larger than they were than the moment in which they >> were written >> >> Note that the code path for "processing" the dictionaries as a first >> step will use the same code as the stream path. It should not be a >> great deal of work to write test cases for this >> >> On Thu, Oct 24, 2019 at 11:06 AM Micah Kornfield <emkornfi...@gmail.com> >> wrote: >> > >> > Hi Antoine, >> > There is a defined order for dictionaries in metadata. What isn't well >> > defined is relative ordering between record batches and Delta dictionaries. >> > >> > However, this point seems confusing. I can't think of a real-world use >> > case we're it would be valuable enough to include, so I will remove Delta >> > dictionaries. >> > >> > So let's cancel this vote and I'll start a new one after the update. >> > >> > Thanks, >> > Micah >> > >> > On Thursday, October 24, 2019, Antoine Pitrou <anto...@python.org> wrote: >> > >> > > >> > > Le 24/10/2019 à 04:39, Micah Kornfield a écrit : >> > > > >> > > > 3. Clarifies that the file format, can only contain 1 "NON-delta" >> > > > dictionary batch and multiple "delta" dictionary batches. >> > > >> > > This is a bit weird. If the file format can carry delta dictionaries, >> > > it means order is significant, so it may as well contain dictionary >> > > redefinitions. >> > > >> > > If the file format is meant to be truly readable in random order, then >> > > it should also forbid delta dictionaries. >> > > >> > > Regards >> > > >> > > Antoine. >> > >