One issue we've struggled with when adding an Arrow interface to Geomesa
is the requirement to send all dictionary batches before record batches
in the IPC formats. Sometimes we have pre-computed "top-k" stats that we
can use to assemble a dictionary beforehand, but those don't always
exist, and even when they do they aren't complete by definition, so we
could end up hiding valuable data in an "Other" category. So in practice
we often have to wait to collect all the data before we can start
streaming anything.
I'd like to propose a couple of modifications to the Arrow IPC formats
that could help alleviate this problem:
1) Allow multiple dictionary batches to use the same id. The vectors in
all dictionary batches with the same id can be concatenated together to
represent the full dictionary with that id.
2) Allow dictionary batches and record batches to be interleaved. For
the streaming format, there could be an additional requirement that any
dictionary key used in a record batch must have been defined in a
previously sent dictionary batch.
These changes would allow producers to send "delta" dictionary batches
in an Arrow stream to define new keys that will be used in future record
batches. Here's an example stream with one column of city names, to help
illustrate the idea:
<SCHEMA>
<DICTIONARY id=0>
(0) "New York"
(1) "Seattle"
(2) "Washington, DC"
<RECORD BATCH 0>
0
1
2
1
<DICTIONARY id=0>
(3) "Chicago"
(4) "San Francisco"
<RECORD BATCH 1>
3
2
4
0
EOS
Decoded Data:
-------------
New York
Seattle
Washington, DC
Seattle
Chicago
Washington, DC
San Francisco
New York
I also think it can be valuable if the requirement mentioned in #2
applies only to the streaming format, so that the random-access format
would support dictionary batches following record batches. That way
producers creating random-access files could start writing record
batches before all the data for the dictionaries has been assembled.
I need to give Paul Taylor credit for this idea - he actually already
wrote the JS arrow reader to combine dictionaries with the same id
(https://github.com/apache/arrow/blob/master/js/src/reader/arrow.ts#L59),
and it occurred to me that that could be a solution for us.
Thanks
Brian