hi Brian, Thanks for bringing this up. I'm +1 on having a mechanism to enable dictionaries to grow or change mid-stream. I figured that this would eventually come up and the current design for the stream does not preclude having dictionaries show up mid-stream. As an example, a service streaming data from Parquet files might send dictionary-encoded versions of some columns, and it would not be practical to have to scan all of the Parquet files of interest to find the global dictionary. The Apache CarbonData format built some Spark-based infrastructure around this exact problem, but we cannot assume that it will be cheap or practical to find the global dictionary up front.
I think having dictionary messages occur after the first record batches is a reasonable strategy. I would suggest we add a "type" field to the DictionaryBatch message type ([1]) so that we can either indicate that the message is a NEW dictionary (i.e. the existing one should be dropped) or a DELTA (additions) to an existing dictionary. I don't think it will be difficult to accommodate this in the C++ implementation, for example (though we will need to finally implement "concatenate" for all supported types to make it work). Thanks, Wes [1]: https://github.com/apache/arrow/blob/master/format/Message.fbs#L86 On Tue, Oct 24, 2017 at 3:44 PM, Brian Hulette <[email protected]> wrote: > One issue we've struggled with when adding an Arrow interface to Geomesa is > the requirement to send all dictionary batches before record batches in the > IPC formats. Sometimes we have pre-computed "top-k" stats that we can use to > assemble a dictionary beforehand, but those don't always exist, and even > when they do they aren't complete by definition, so we could end up hiding > valuable data in an "Other" category. So in practice we often have to wait > to collect all the data before we can start streaming anything. > > I'd like to propose a couple of modifications to the Arrow IPC formats that > could help alleviate this problem: > 1) Allow multiple dictionary batches to use the same id. The vectors in all > dictionary batches with the same id can be concatenated together to > represent the full dictionary with that id. > 2) Allow dictionary batches and record batches to be interleaved. For the > streaming format, there could be an additional requirement that any > dictionary key used in a record batch must have been defined in a previously > sent dictionary batch. > > These changes would allow producers to send "delta" dictionary batches in an > Arrow stream to define new keys that will be used in future record batches. > Here's an example stream with one column of city names, to help illustrate > the idea: > > <SCHEMA> > <DICTIONARY id=0> > (0) "New York" > (1) "Seattle" > (2) "Washington, DC" > > <RECORD BATCH 0> > 0 > 1 > 2 > 1 > > <DICTIONARY id=0> > (3) "Chicago" > (4) "San Francisco" > > <RECORD BATCH 1> > 3 > 2 > 4 > 0 > EOS > > > Decoded Data: > ------------- > New York > Seattle > Washington, DC > Seattle > Chicago > Washington, DC > San Francisco > New York > > > I also think it can be valuable if the requirement mentioned in #2 applies > only to the streaming format, so that the random-access format would support > dictionary batches following record batches. That way producers creating > random-access files could start writing record batches before all the data > for the dictionaries has been assembled. > > I need to give Paul Taylor credit for this idea - he actually already wrote > the JS arrow reader to combine dictionaries with the same id > (https://github.com/apache/arrow/blob/master/js/src/reader/arrow.ts#L59), > and it occurred to me that that could be a solution for us. > > Thanks > Brian >
