Re: [Format] Semantics for dictionary batches in streams

Jacques Nadeau Sun, 11 Aug 2019 18:33:03 -0700

Wow, you've shown how little I've thought about Arrow dictionaries for a
while. I thought we had a dictionary id and a record-in-dictionary-id.
Wouldn't that approach make more sense? Does no one do this today? (We
frequently use compound values for this type of scenario...)


On Sat, Aug 10, 2019 at 4:20 PM Micah Kornfield <emkornfi...@gmail.com>
wrote:

> Reading data from two different parquet files sequentially with different
> dictionaries for the same column.  This could be handled by re-encoding
> data but that seems potentially sub-optimal.
>
> On Sat, Aug 10, 2019 at 12:38 PM Jacques Nadeau <jacq...@apache.org>
> wrote:
>
>> What situation are anticipating where you're going to be restating ids
>> mid stream?
>>
>> On Sat, Aug 10, 2019 at 12:13 AM Micah Kornfield <emkornfi...@gmail.com>
>> wrote:
>>
>>> The IPC specification [1] defines behavior when isDelta on a
>>> DictionaryBatch [2] is "true".  I might have missed it in the
>>> specification, but I couldn't find the interpretation for what the
>>> expected
>>> behavior is when isDelta=false and and two  dictionary batches  with the
>>> same ID are sent.
>>>
>>> It seems like there are two options:
>>> 1.  Interpret the new dictionary batch as replacing the old one.
>>> 2.  Regard this as an error condition.
>>>
>>> Based on the fact that in the "file format" dictionaries are allowed to
>>> be
>>> placed in any order relative to the record batches, I assume it is the
>>> second, but just wanted to make sure.
>>>
>>> Thanks,
>>> Micah
>>>
>>> [1] https://arrow.apache.org/docs/ipc.html
>>> [2] https://github.com/apache/arrow/blob/master/format/Message.fbs#L72
>>>
>>

Re: [Format] Semantics for dictionary batches in streams

Reply via email to