What I'd proposed was to add metadata to indicate either an append (DELTA) or a replacement (NEW)
On Wed, Oct 25, 2017 at 9:23 PM, Jacques Nadeau <[email protected]> wrote: > Is the proposal to only append to the dictionary or to redefine it? > > > On Wed, Oct 25, 2017 at 7:16 AM, Wes McKinney <[email protected]> wrote: > >> Opened https://issues.apache.org/jira/browse/ARROW-1727 >> >> On Tue, Oct 24, 2017 at 6:16 PM, Wes McKinney <[email protected]> wrote: >> > hi Brian, >> > >> > Thanks for bringing this up. I'm +1 on having a mechanism to enable >> > dictionaries to grow or change mid-stream. I figured that this would >> > eventually come up and the current design for the stream does not >> > preclude having dictionaries show up mid-stream. As an example, a >> > service streaming data from Parquet files might send >> > dictionary-encoded versions of some columns, and it would not be >> > practical to have to scan all of the Parquet files of interest to find >> > the global dictionary. The Apache CarbonData format built some >> > Spark-based infrastructure around this exact problem, but we cannot >> > assume that it will be cheap or practical to find the global >> > dictionary up front. >> > >> > I think having dictionary messages occur after the first record >> > batches is a reasonable strategy. I would suggest we add a "type" >> > field to the DictionaryBatch message type ([1]) so that we can either >> > indicate that the message is a NEW dictionary (i.e. the existing one >> > should be dropped) or a DELTA (additions) to an existing dictionary. I >> > don't think it will be difficult to accommodate this in the C++ >> > implementation, for example (though we will need to finally implement >> > "concatenate" for all supported types to make it work). >> > >> > Thanks, >> > Wes >> > >> > [1]: https://github.com/apache/arrow/blob/master/format/Message.fbs#L86 >> > >> > On Tue, Oct 24, 2017 at 3:44 PM, Brian Hulette <[email protected]> >> wrote: >> >> One issue we've struggled with when adding an Arrow interface to >> Geomesa is >> >> the requirement to send all dictionary batches before record batches in >> the >> >> IPC formats. Sometimes we have pre-computed "top-k" stats that we can >> use to >> >> assemble a dictionary beforehand, but those don't always exist, and even >> >> when they do they aren't complete by definition, so we could end up >> hiding >> >> valuable data in an "Other" category. So in practice we often have to >> wait >> >> to collect all the data before we can start streaming anything. >> >> >> >> I'd like to propose a couple of modifications to the Arrow IPC formats >> that >> >> could help alleviate this problem: >> >> 1) Allow multiple dictionary batches to use the same id. The vectors in >> all >> >> dictionary batches with the same id can be concatenated together to >> >> represent the full dictionary with that id. >> >> 2) Allow dictionary batches and record batches to be interleaved. For >> the >> >> streaming format, there could be an additional requirement that any >> >> dictionary key used in a record batch must have been defined in a >> previously >> >> sent dictionary batch. >> >> >> >> These changes would allow producers to send "delta" dictionary batches >> in an >> >> Arrow stream to define new keys that will be used in future record >> batches. >> >> Here's an example stream with one column of city names, to help >> illustrate >> >> the idea: >> >> >> >> <SCHEMA> >> >> <DICTIONARY id=0> >> >> (0) "New York" >> >> (1) "Seattle" >> >> (2) "Washington, DC" >> >> >> >> <RECORD BATCH 0> >> >> 0 >> >> 1 >> >> 2 >> >> 1 >> >> >> >> <DICTIONARY id=0> >> >> (3) "Chicago" >> >> (4) "San Francisco" >> >> >> >> <RECORD BATCH 1> >> >> 3 >> >> 2 >> >> 4 >> >> 0 >> >> EOS >> >> >> >> >> >> Decoded Data: >> >> ------------- >> >> New York >> >> Seattle >> >> Washington, DC >> >> Seattle >> >> Chicago >> >> Washington, DC >> >> San Francisco >> >> New York >> >> >> >> >> >> I also think it can be valuable if the requirement mentioned in #2 >> applies >> >> only to the streaming format, so that the random-access format would >> support >> >> dictionary batches following record batches. That way producers creating >> >> random-access files could start writing record batches before all the >> data >> >> for the dictionaries has been assembled. >> >> >> >> I need to give Paul Taylor credit for this idea - he actually already >> wrote >> >> the JS arrow reader to combine dictionaries with the same id >> >> (https://github.com/apache/arrow/blob/master/js/src/reader/arrow.ts#L59 >> ), >> >> and it occurred to me that that could be a solution for us. >> >> >> >> Thanks >> >> Brian >> >> >>
