I think the only point asked on the PR that I think is worth discussing is assumptions about dictionaries at the beginning of streams.
There are two options: 1. Based on the current wording, it does not seem that all dictionaries need to be at the beginning of the stream if they aren't made use of in the first record batch (i.e. a dictionary encoded column is all null in the first record batch). 2. We require a dictionary batch for each dictionary at the beginning of the stream (and require implementations to send an empty batch if they don't have the dictionary available). The current proposal in the PR is option #1. Thanks, Micah On Sat, Oct 5, 2019 at 4:01 PM Micah Kornfield <emkornfi...@gmail.com> wrote: > I've opened a pull request [1] to clarify some recent conversations about > semantics/edge cases for dictionary encoding [2][3] around interleaved > batches and when isDelta=False. > > Specifically, it proposes isDelta=False indicates dictionary replacement. > For the file format, only one isDelta=False batch is allowed per file and > isDelta=true batches are applied in the order supplied file footer. > > In addition, I've added a new enum to DictionaryEncoding to preserve > future compatibility in case we want to expand dictionary encoding to be an > explicit mapping from "ID" to "VALUE" as discussed in [4]. > > Once people have had a change to review and come to a consensus. I will > call a formal vote to approve the change commit the change. > > Thanks, > Micah > > [1] https://github.com/apache/arrow/pull/5585 > [2] > https://lists.apache.org/thread.html/9734b71bc12aca16eb997388e95105bff412fdaefa4e19422f477389@%3Cdev.arrow.apache.org%3E > [3] > https://lists.apache.org/thread.html/5c3c9346101df8d758e24664638e8ada0211d310ab756a89cde3786a@%3Cdev.arrow.apache.org%3E > [4] > https://lists.apache.org/thread.html/15a4810589b2eb772bce5b2372970d9d93badbd28999a1bbe2af418a@%3Cdev.arrow.apache.org%3E > >