I was thinking the file format must satisfy one of two conditions: 1. Exactly one dictionarybatch per encoded column 2. DictionaryBatches are interleaved correctly.
On Tuesday, August 27, 2019, Wes McKinney <wesmck...@gmail.com> wrote: > On Tue, Aug 27, 2019 at 3:55 PM Antoine Pitrou <anto...@python.org> wrote: > > > > > > Le 27/08/2019 à 22:31, Wes McKinney a écrit : > > > So the current situation we have right now in C++ is that if we tried > > > to create an IPC stream from a sequence of record batches that don't > > > all have the same dictionary, we'd run into two scenarios: > > > > > > * Batches that either have a prefix of a prior-observed dictionary, or > > > the prior dictionary is a prefix of their dictionary. For example, > > > suppose that the dictionary sent for an id was ['A', 'B', 'C'] and > > > then there's a subsequent batch with ['A', 'B', 'C', 'D', 'E']. In > > > such case we could compute and send a delta batch > > > > > > * Batches with a dictionary that is a permutation of values, and > > > possibly new unique values. > > > > > > In this latter case, without the option of replacing an existing ID in > > > the stream, we would have to do a unification / permutation of indices > > > and then also possibly send a delta batch. We should probably have > > > code at some point that deals with both cases, but in the meantime I > > > would like to allow dictionaries to be redefined in this case. Seems > > > like we might need a vote to formalize this? > > > > Isn't the stream format deviating from the file format then? In the > > file format, IIUC, dictionaries can appear after the respective record > > batches, so there's no way to tell whether the original or redefined > > version of a dictionary is being referred to. > > You make a good point -- we can consider changes to the file format to > allow for record batches to have different dictionaries. Even handling > delta dictionaries with the current file format would be a bit tedious > (though not indeterminate) > > > Regards > > > > Antoine. >