Opened https://issues.apache.org/jira/browse/ARROW-1727
On Tue, Oct 24, 2017 at 6:16 PM, Wes McKinney <[email protected]> wrote: > hi Brian, > > Thanks for bringing this up. I'm +1 on having a mechanism to enable > dictionaries to grow or change mid-stream. I figured that this would > eventually come up and the current design for the stream does not > preclude having dictionaries show up mid-stream. As an example, a > service streaming data from Parquet files might send > dictionary-encoded versions of some columns, and it would not be > practical to have to scan all of the Parquet files of interest to find > the global dictionary. The Apache CarbonData format built some > Spark-based infrastructure around this exact problem, but we cannot > assume that it will be cheap or practical to find the global > dictionary up front. > > I think having dictionary messages occur after the first record > batches is a reasonable strategy. I would suggest we add a "type" > field to the DictionaryBatch message type ([1]) so that we can either > indicate that the message is a NEW dictionary (i.e. the existing one > should be dropped) or a DELTA (additions) to an existing dictionary. I > don't think it will be difficult to accommodate this in the C++ > implementation, for example (though we will need to finally implement > "concatenate" for all supported types to make it work). > > Thanks, > Wes > > [1]: https://github.com/apache/arrow/blob/master/format/Message.fbs#L86 > > On Tue, Oct 24, 2017 at 3:44 PM, Brian Hulette <[email protected]> wrote: >> One issue we've struggled with when adding an Arrow interface to Geomesa is >> the requirement to send all dictionary batches before record batches in the >> IPC formats. Sometimes we have pre-computed "top-k" stats that we can use to >> assemble a dictionary beforehand, but those don't always exist, and even >> when they do they aren't complete by definition, so we could end up hiding >> valuable data in an "Other" category. So in practice we often have to wait >> to collect all the data before we can start streaming anything. >> >> I'd like to propose a couple of modifications to the Arrow IPC formats that >> could help alleviate this problem: >> 1) Allow multiple dictionary batches to use the same id. The vectors in all >> dictionary batches with the same id can be concatenated together to >> represent the full dictionary with that id. >> 2) Allow dictionary batches and record batches to be interleaved. For the >> streaming format, there could be an additional requirement that any >> dictionary key used in a record batch must have been defined in a previously >> sent dictionary batch. >> >> These changes would allow producers to send "delta" dictionary batches in an >> Arrow stream to define new keys that will be used in future record batches. >> Here's an example stream with one column of city names, to help illustrate >> the idea: >> >> <SCHEMA> >> <DICTIONARY id=0> >> (0) "New York" >> (1) "Seattle" >> (2) "Washington, DC" >> >> <RECORD BATCH 0> >> 0 >> 1 >> 2 >> 1 >> >> <DICTIONARY id=0> >> (3) "Chicago" >> (4) "San Francisco" >> >> <RECORD BATCH 1> >> 3 >> 2 >> 4 >> 0 >> EOS >> >> >> Decoded Data: >> ------------- >> New York >> Seattle >> Washington, DC >> Seattle >> Chicago >> Washington, DC >> San Francisco >> New York >> >> >> I also think it can be valuable if the requirement mentioned in #2 applies >> only to the streaming format, so that the random-access format would support >> dictionary batches following record batches. That way producers creating >> random-access files could start writing record batches before all the data >> for the dictionaries has been assembled. >> >> I need to give Paul Taylor credit for this idea - he actually already wrote >> the JS arrow reader to combine dictionaries with the same id >> (https://github.com/apache/arrow/blob/master/js/src/reader/arrow.ts#L59), >> and it occurred to me that that could be a solution for us. >> >> Thanks >> Brian >>
