Opened https://issues.apache.org/jira/browse/ARROW-1727

On Tue, Oct 24, 2017 at 6:16 PM, Wes McKinney <[email protected]> wrote:
> hi Brian,
>
> Thanks for bringing this up. I'm +1 on having a mechanism to enable
> dictionaries to grow or change mid-stream. I figured that this would
> eventually come up and the current design for the stream does not
> preclude having dictionaries show up mid-stream. As an example, a
> service streaming data from Parquet files might send
> dictionary-encoded versions of some columns, and it would not be
> practical to have to scan all of the Parquet files of interest to find
> the global dictionary. The Apache CarbonData format built some
> Spark-based infrastructure around this exact problem, but we cannot
> assume that it will be cheap or practical to find the global
> dictionary up front.
>
> I think having dictionary messages occur after the first record
> batches is a reasonable strategy. I would suggest we add a "type"
> field to the DictionaryBatch message type ([1]) so that we can either
> indicate that the message is a NEW dictionary (i.e. the existing one
> should be dropped) or a DELTA (additions) to an existing dictionary. I
> don't think it will be difficult to accommodate this in the C++
> implementation, for example (though we will need to finally implement
> "concatenate" for all supported types to make it work).
>
> Thanks,
> Wes
>
> [1]: https://github.com/apache/arrow/blob/master/format/Message.fbs#L86
>
> On Tue, Oct 24, 2017 at 3:44 PM, Brian Hulette <[email protected]> wrote:
>> One issue we've struggled with when adding an Arrow interface to Geomesa is
>> the requirement to send all dictionary batches before record batches in the
>> IPC formats. Sometimes we have pre-computed "top-k" stats that we can use to
>> assemble a dictionary beforehand, but those don't always exist, and even
>> when they do they aren't complete by definition, so we could end up hiding
>> valuable data in an "Other" category. So in practice we often have to wait
>> to collect all the data before we can start streaming anything.
>>
>> I'd like to propose a couple of modifications to the Arrow IPC formats that
>> could help alleviate this problem:
>> 1) Allow multiple dictionary batches to use the same id. The vectors in all
>> dictionary batches with the same id can be concatenated together to
>> represent the full dictionary with that id.
>> 2) Allow dictionary batches and record batches to be interleaved. For the
>> streaming format, there could be an additional requirement that any
>> dictionary key used in a record batch must have been defined in a previously
>> sent dictionary batch.
>>
>> These changes would allow producers to send "delta" dictionary batches in an
>> Arrow stream to define new keys that will be used in future record batches.
>> Here's an example stream with one column of city names, to help illustrate
>> the idea:
>>
>> <SCHEMA>
>> <DICTIONARY id=0>
>> (0) "New York"
>> (1) "Seattle"
>> (2) "Washington, DC"
>>
>> <RECORD BATCH 0>
>> 0
>> 1
>> 2
>> 1
>>
>> <DICTIONARY id=0>
>> (3) "Chicago"
>> (4) "San Francisco"
>>
>> <RECORD BATCH 1>
>> 3
>> 2
>> 4
>> 0
>> EOS
>>
>>
>> Decoded Data:
>> -------------
>> New York
>> Seattle
>> Washington, DC
>> Seattle
>> Chicago
>> Washington, DC
>> San Francisco
>> New York
>>
>>
>> I also think it can be valuable if the requirement mentioned in #2 applies
>> only to the streaming format, so that the random-access format would support
>> dictionary batches following record batches. That way producers creating
>> random-access files could start writing record batches before all the data
>> for the dictionaries has been assembled.
>>
>> I need to give Paul Taylor credit for this idea - he actually already wrote
>> the JS arrow reader to combine dictionaries with the same id
>> (https://github.com/apache/arrow/blob/master/js/src/reader/arrow.ts#L59),
>> and it occurred to me that that could be a solution for us.
>>
>> Thanks
>> Brian
>>

Reply via email to