Yes, I opened a JIRA, I'm going to try to make a proposal that consolidates all the recent dictionary discussions.
On Mon, Sep 9, 2019 at 12:21 PM Wes McKinney <wesmck...@gmail.com> wrote: > hi Micah, > > I think we should formulate changes to format/Columnar.rst and have a > vote, what do you think? > > On Thu, Aug 29, 2019 at 2:23 AM Micah Kornfield <emkornfi...@gmail.com> > wrote: > >> > >> > >> > I was thinking the file format must satisfy one of two conditions: > >> > 1. Exactly one dictionarybatch per encoded column > >> > 2. DictionaryBatches are interleaved correctly. > >> > >> Could you clarify? > > > > I think you clarified it very well :) My motivation for suggesting the > additional complexity is I see two use-cases for the file format. These > roughly correspond with the two options I suggested: > > 1. We are encoding data from scratch. In this case, it seems like all > dictionaries would be built incrementally, not need replacement and we > write them at the end of the file [1] > > > > 2. The data being written out is essentially a "tee" off of some stream > that is generating new dictionaries requiring replacement on the fly (i.e. > reading back two parquet files). > > > >> It might be better to disallow replacements > >> in the file format (which does introduce semantic slippage between the > >> file and stream formats as Antoine was saying). > > > > It is is certainly possible, to accept the slippage from the stream > format for now and later add this capability, since it should be forwards > compatible. > > > > Thanks, > > Micah > > > > [1] There is also medium complexity option where we require one > non-delta dictionary and as many delta dictionaries as the user want. > > > > On Wed, Aug 28, 2019 at 7:50 AM Wes McKinney <wesmck...@gmail.com> > wrote: > >> > >> On Tue, Aug 27, 2019 at 6:05 PM Micah Kornfield <emkornfi...@gmail.com> > wrote: > >> > > >> > I was thinking the file format must satisfy one of two conditions: > >> > 1. Exactly one dictionarybatch per encoded column > >> > 2. DictionaryBatches are interleaved correctly. > >> > >> Could you clarify? In the first case, there is no issue with > >> dictionary replacements. I'm not sure about the second case -- if a > >> dictionary id appears twice, then you'll see it twice in the file > >> footer. I suppose you could look at the file offsets to determine > >> whether a dictionary batch precedes a particular record batch block > >> (to know which dictionary you should be using), but that's rather > >> complicated to implement. It might be better to disallow replacements > >> in the file format (which does introduce semantic slippage between the > >> file and stream formats as Antoine was saying). > >> > >> > > >> > On Tuesday, August 27, 2019, Wes McKinney <wesmck...@gmail.com> > wrote: > >> > > >> > > On Tue, Aug 27, 2019 at 3:55 PM Antoine Pitrou <anto...@python.org> > wrote: > >> > > > > >> > > > > >> > > > Le 27/08/2019 à 22:31, Wes McKinney a écrit : > >> > > > > So the current situation we have right now in C++ is that if we > tried > >> > > > > to create an IPC stream from a sequence of record batches that > don't > >> > > > > all have the same dictionary, we'd run into two scenarios: > >> > > > > > >> > > > > * Batches that either have a prefix of a prior-observed > dictionary, or > >> > > > > the prior dictionary is a prefix of their dictionary. For > example, > >> > > > > suppose that the dictionary sent for an id was ['A', 'B', 'C'] > and > >> > > > > then there's a subsequent batch with ['A', 'B', 'C', 'D', 'E']. > In > >> > > > > such case we could compute and send a delta batch > >> > > > > > >> > > > > * Batches with a dictionary that is a permutation of values, and > >> > > > > possibly new unique values. > >> > > > > > >> > > > > In this latter case, without the option of replacing an > existing ID in > >> > > > > the stream, we would have to do a unification / permutation of > indices > >> > > > > and then also possibly send a delta batch. We should probably > have > >> > > > > code at some point that deals with both cases, but in the > meantime I > >> > > > > would like to allow dictionaries to be redefined in this case. > Seems > >> > > > > like we might need a vote to formalize this? > >> > > > > >> > > > Isn't the stream format deviating from the file format then? In > the > >> > > > file format, IIUC, dictionaries can appear after the respective > record > >> > > > batches, so there's no way to tell whether the original or > redefined > >> > > > version of a dictionary is being referred to. > >> > > > >> > > You make a good point -- we can consider changes to the file format > to > >> > > allow for record batches to have different dictionaries. Even > handling > >> > > delta dictionaries with the current file format would be a bit > tedious > >> > > (though not indeterminate) > >> > > > >> > > > Regards > >> > > > > >> > > > Antoine. > >> > > >