Is the proposal to only append to the dictionary or to redefine it?
On Wed, Oct 25, 2017 at 7:16 AM, Wes McKinney <[email protected]> wrote: > Opened https://issues.apache.org/jira/browse/ARROW-1727 > > On Tue, Oct 24, 2017 at 6:16 PM, Wes McKinney <[email protected]> wrote: > > hi Brian, > > > > Thanks for bringing this up. I'm +1 on having a mechanism to enable > > dictionaries to grow or change mid-stream. I figured that this would > > eventually come up and the current design for the stream does not > > preclude having dictionaries show up mid-stream. As an example, a > > service streaming data from Parquet files might send > > dictionary-encoded versions of some columns, and it would not be > > practical to have to scan all of the Parquet files of interest to find > > the global dictionary. The Apache CarbonData format built some > > Spark-based infrastructure around this exact problem, but we cannot > > assume that it will be cheap or practical to find the global > > dictionary up front. > > > > I think having dictionary messages occur after the first record > > batches is a reasonable strategy. I would suggest we add a "type" > > field to the DictionaryBatch message type ([1]) so that we can either > > indicate that the message is a NEW dictionary (i.e. the existing one > > should be dropped) or a DELTA (additions) to an existing dictionary. I > > don't think it will be difficult to accommodate this in the C++ > > implementation, for example (though we will need to finally implement > > "concatenate" for all supported types to make it work). > > > > Thanks, > > Wes > > > > [1]: https://github.com/apache/arrow/blob/master/format/Message.fbs#L86 > > > > On Tue, Oct 24, 2017 at 3:44 PM, Brian Hulette <[email protected]> > wrote: > >> One issue we've struggled with when adding an Arrow interface to > Geomesa is > >> the requirement to send all dictionary batches before record batches in > the > >> IPC formats. Sometimes we have pre-computed "top-k" stats that we can > use to > >> assemble a dictionary beforehand, but those don't always exist, and even > >> when they do they aren't complete by definition, so we could end up > hiding > >> valuable data in an "Other" category. So in practice we often have to > wait > >> to collect all the data before we can start streaming anything. > >> > >> I'd like to propose a couple of modifications to the Arrow IPC formats > that > >> could help alleviate this problem: > >> 1) Allow multiple dictionary batches to use the same id. The vectors in > all > >> dictionary batches with the same id can be concatenated together to > >> represent the full dictionary with that id. > >> 2) Allow dictionary batches and record batches to be interleaved. For > the > >> streaming format, there could be an additional requirement that any > >> dictionary key used in a record batch must have been defined in a > previously > >> sent dictionary batch. > >> > >> These changes would allow producers to send "delta" dictionary batches > in an > >> Arrow stream to define new keys that will be used in future record > batches. > >> Here's an example stream with one column of city names, to help > illustrate > >> the idea: > >> > >> <SCHEMA> > >> <DICTIONARY id=0> > >> (0) "New York" > >> (1) "Seattle" > >> (2) "Washington, DC" > >> > >> <RECORD BATCH 0> > >> 0 > >> 1 > >> 2 > >> 1 > >> > >> <DICTIONARY id=0> > >> (3) "Chicago" > >> (4) "San Francisco" > >> > >> <RECORD BATCH 1> > >> 3 > >> 2 > >> 4 > >> 0 > >> EOS > >> > >> > >> Decoded Data: > >> ------------- > >> New York > >> Seattle > >> Washington, DC > >> Seattle > >> Chicago > >> Washington, DC > >> San Francisco > >> New York > >> > >> > >> I also think it can be valuable if the requirement mentioned in #2 > applies > >> only to the streaming format, so that the random-access format would > support > >> dictionary batches following record batches. That way producers creating > >> random-access files could start writing record batches before all the > data > >> for the dictionaries has been assembled. > >> > >> I need to give Paul Taylor credit for this idea - he actually already > wrote > >> the JS arrow reader to combine dictionaries with the same id > >> (https://github.com/apache/arrow/blob/master/js/src/reader/arrow.ts#L59 > ), > >> and it occurred to me that that could be a solution for us. > >> > >> Thanks > >> Brian > >> >
