Wow, you've shown how little I've thought about Arrow dictionaries for a while. I thought we had a dictionary id and a record-in-dictionary-id. Wouldn't that approach make more sense? Does no one do this today? (We frequently use compound values for this type of scenario...)
On Sat, Aug 10, 2019 at 4:20 PM Micah Kornfield <emkornfi...@gmail.com> wrote: > Reading data from two different parquet files sequentially with different > dictionaries for the same column. This could be handled by re-encoding > data but that seems potentially sub-optimal. > > On Sat, Aug 10, 2019 at 12:38 PM Jacques Nadeau <jacq...@apache.org> > wrote: > >> What situation are anticipating where you're going to be restating ids >> mid stream? >> >> On Sat, Aug 10, 2019 at 12:13 AM Micah Kornfield <emkornfi...@gmail.com> >> wrote: >> >>> The IPC specification [1] defines behavior when isDelta on a >>> DictionaryBatch [2] is "true". I might have missed it in the >>> specification, but I couldn't find the interpretation for what the >>> expected >>> behavior is when isDelta=false and and two dictionary batches with the >>> same ID are sent. >>> >>> It seems like there are two options: >>> 1. Interpret the new dictionary batch as replacing the old one. >>> 2. Regard this as an error condition. >>> >>> Based on the fact that in the "file format" dictionaries are allowed to >>> be >>> placed in any order relative to the record batches, I assume it is the >>> second, but just wanted to make sure. >>> >>> Thanks, >>> Micah >>> >>> [1] https://arrow.apache.org/docs/ipc.html >>> [2] https://github.com/apache/arrow/blob/master/format/Message.fbs#L72 >>> >>