I think the main sticking point was dictionaries in the file format. It seems like the use-case for delta dictionaries might be limited so I didn't feel strongly about it.
Antoine, did you have more thoughts on this? Thanks, Micah On Wed, Nov 6, 2019 at 9:24 AM Wes McKinney <wesmck...@gmail.com> wrote: > Just bumping this thread for more comments > > On Wed, Oct 30, 2019 at 3:11 PM Wes McKinney <wesmck...@gmail.com> wrote: > > > > Returning to this discussion as there seems to lack consensus in the > vote thread > > > > Copying Micah's proposals in the VOTE thread here, I wanted to state > > my opinions so we can discuss further and see where there is potential > > disagreement > > > > 1. It is not required that all dictionary batches occur at the beginning > > of the IPC stream format (if a the first record batch has an all null > > dictionary encoded column, the null column's dictionary might not be sent > > until later in the stream). > > > > This seems preferable to requiring a placeholder empty dictionary > > batch. This does mean more to test but the integration tests will > > force the issue > > > > 2. A second dictionary batch for the same ID that is not a "delta batch" > > in an IPC stream indicates the dictionary should be replaced. > > > > Agree. > > > > 3. Clarifies that the file format, can only contain 1 "NON-delta" > > dictionary batch and multiple "delta" dictionary batches. > > > > Agree -- it is also worth stating explicitly that dictionary > > replacements are not allowed in the file format. > > > > In the file format, all the dictionaries must be "loaded" up front. > > The code path for loading the dictionaries ideally should use nearly > > the same code as the stream-reader code that sees follow-up dictionary > > batches interspersed in the stream. The only downside is that it will > > not be possible to exactly preserve the dictionary "state" as of each > > record batch being written. > > > > So if we had a file containing > > > > DICTIONARY ID=0 > > RECORD BATCH > > RECORD BATCH > > DICTIONARY DELTA ID=0 > > RECORD BATCH > > RECORD BATCH > > > > Then after processing/loading the dictionaries, the first two record > > batches will have a dictionary that is "larger" (on account of the > > delta) than when they were written. Since dictionaries are > > fundamentally about data representation, they still represent the same > > data so I think this is acceptable. > > > > 4. Add an enum to dictionary metadata for possible future changes in > what > > format dictionary batches can be sent. (the most likely would be an array > > Map<Int, Value>). An enum is needed as a place holder to allow for > forward > > compatibility past the release 1.0.0. > > > > I'm least sure about this but I do not think it is harmful to have a > > forward-compatible "escape hatch" for future evolutions in dictionary > > encoding. > > > > On Wed, Oct 16, 2019 at 2:57 AM Micah Kornfield <emkornfi...@gmail.com> > wrote: > > > > > > I'll plan on starting a vote in the next day or two if there are no > further > > > objections/comments. > > > > > > On Sun, Oct 13, 2019 at 11:06 AM Micah Kornfield < > emkornfi...@gmail.com> > > > wrote: > > > > > > > I think the only point asked on the PR that I think is worth > discussing is > > > > assumptions about dictionaries at the beginning of streams. > > > > > > > > There are two options: > > > > 1. Based on the current wording, it does not seem that all > dictionaries > > > > need to be at the beginning of the stream if they aren't made use of > in the > > > > first record batch (i.e. a dictionary encoded column is all null in > the > > > > first record batch). > > > > 2. We require a dictionary batch for each dictionary at the > beginning of > > > > the stream (and require implementations to send an empty batch if > they > > > > don't have the dictionary available). > > > > > > > > The current proposal in the PR is option #1. > > > > > > > > Thanks, > > > > Micah > > > > > > > > On Sat, Oct 5, 2019 at 4:01 PM Micah Kornfield < > emkornfi...@gmail.com> > > > > wrote: > > > > > > > >> I've opened a pull request [1] to clarify some recent conversations > about > > > >> semantics/edge cases for dictionary encoding [2][3] around > interleaved > > > >> batches and when isDelta=False. > > > >> > > > >> Specifically, it proposes isDelta=False indicates dictionary > > > >> replacement. For the file format, only one isDelta=False batch is > allowed > > > >> per file and isDelta=true batches are applied in the order supplied > file > > > >> footer. > > > >> > > > >> In addition, I've added a new enum to DictionaryEncoding to preserve > > > >> future compatibility in case we want to expand dictionary encoding > to be an > > > >> explicit mapping from "ID" to "VALUE" as discussed in [4]. > > > >> > > > >> Once people have had a change to review and come to a consensus. I > will > > > >> call a formal vote to approve the change commit the change. > > > >> > > > >> Thanks, > > > >> Micah > > > >> > > > >> [1] https://github.com/apache/arrow/pull/5585 > > > >> [2] > > > >> > https://lists.apache.org/thread.html/9734b71bc12aca16eb997388e95105bff412fdaefa4e19422f477389@%3Cdev.arrow.apache.org%3E > > > >> [3] > > > >> > https://lists.apache.org/thread.html/5c3c9346101df8d758e24664638e8ada0211d310ab756a89cde3786a@%3Cdev.arrow.apache.org%3E > > > >> [4] > > > >> > https://lists.apache.org/thread.html/15a4810589b2eb772bce5b2372970d9d93badbd28999a1bbe2af418a@%3Cdev.arrow.apache.org%3E > > > >> > > > >> >