Re: [DISCUSS] Allow "delta" dictionary batches

Jacques Nadeau Wed, 25 Oct 2017 18:23:38 -0700

Is the proposal to only append to the dictionary or to redefine it?


On Wed, Oct 25, 2017 at 7:16 AM, Wes McKinney <[email protected]> wrote:

> Opened https://issues.apache.org/jira/browse/ARROW-1727
>
> On Tue, Oct 24, 2017 at 6:16 PM, Wes McKinney <[email protected]> wrote:
> > hi Brian,
> >
> > Thanks for bringing this up. I'm +1 on having a mechanism to enable
> > dictionaries to grow or change mid-stream. I figured that this would
> > eventually come up and the current design for the stream does not
> > preclude having dictionaries show up mid-stream. As an example, a
> > service streaming data from Parquet files might send
> > dictionary-encoded versions of some columns, and it would not be
> > practical to have to scan all of the Parquet files of interest to find
> > the global dictionary. The Apache CarbonData format built some
> > Spark-based infrastructure around this exact problem, but we cannot
> > assume that it will be cheap or practical to find the global
> > dictionary up front.
> >
> > I think having dictionary messages occur after the first record
> > batches is a reasonable strategy. I would suggest we add a "type"
> > field to the DictionaryBatch message type ([1]) so that we can either
> > indicate that the message is a NEW dictionary (i.e. the existing one
> > should be dropped) or a DELTA (additions) to an existing dictionary. I
> > don't think it will be difficult to accommodate this in the C++
> > implementation, for example (though we will need to finally implement
> > "concatenate" for all supported types to make it work).
> >
> > Thanks,
> > Wes
> >
> > [1]: https://github.com/apache/arrow/blob/master/format/Message.fbs#L86
> >
> > On Tue, Oct 24, 2017 at 3:44 PM, Brian Hulette <[email protected]>
> wrote:
> >> One issue we've struggled with when adding an Arrow interface to
> Geomesa is
> >> the requirement to send all dictionary batches before record batches in
> the
> >> IPC formats. Sometimes we have pre-computed "top-k" stats that we can
> use to
> >> assemble a dictionary beforehand, but those don't always exist, and even
> >> when they do they aren't complete by definition, so we could end up
> hiding
> >> valuable data in an "Other" category. So in practice we often have to
> wait
> >> to collect all the data before we can start streaming anything.
> >>
> >> I'd like to propose a couple of modifications to the Arrow IPC formats
> that
> >> could help alleviate this problem:
> >> 1) Allow multiple dictionary batches to use the same id. The vectors in
> all
> >> dictionary batches with the same id can be concatenated together to
> >> represent the full dictionary with that id.
> >> 2) Allow dictionary batches and record batches to be interleaved. For
> the
> >> streaming format, there could be an additional requirement that any
> >> dictionary key used in a record batch must have been defined in a
> previously
> >> sent dictionary batch.
> >>
> >> These changes would allow producers to send "delta" dictionary batches
> in an
> >> Arrow stream to define new keys that will be used in future record
> batches.
> >> Here's an example stream with one column of city names, to help
> illustrate
> >> the idea:
> >>
> >> <SCHEMA>
> >> <DICTIONARY id=0>
> >> (0) "New York"
> >> (1) "Seattle"
> >> (2) "Washington, DC"
> >>
> >> <RECORD BATCH 0>
> >> 0
> >> 1
> >> 2
> >> 1
> >>
> >> <DICTIONARY id=0>
> >> (3) "Chicago"
> >> (4) "San Francisco"
> >>
> >> <RECORD BATCH 1>
> >> 3
> >> 2
> >> 4
> >> 0
> >> EOS
> >>
> >>
> >> Decoded Data:
> >> -------------
> >> New York
> >> Seattle
> >> Washington, DC
> >> Seattle
> >> Chicago
> >> Washington, DC
> >> San Francisco
> >> New York
> >>
> >>
> >> I also think it can be valuable if the requirement mentioned in #2
> applies
> >> only to the streaming format, so that the random-access format would
> support
> >> dictionary batches following record batches. That way producers creating
> >> random-access files could start writing record batches before all the
> data
> >> for the dictionaries has been assembled.
> >>
> >> I need to give Paul Taylor credit for this idea - he actually already
> wrote
> >> the JS arrow reader to combine dictionaries with the same id
> >> (https://github.com/apache/arrow/blob/master/js/src/reader/arrow.ts#L59
> ),
> >> and it occurred to me that that could be a solution for us.
> >>
> >> Thanks
> >> Brian
> >>
>

Re: [DISCUSS] Allow "delta" dictionary batches

Reply via email to