[DISCUSS] Allow "delta" dictionary batches

Brian Hulette Tue, 24 Oct 2017 12:44:06 -0700

One issue we've struggled with when adding an Arrow interface to Geomesais the requirement to send all dictionary batches before record batchesin the IPC formats. Sometimes we have pre-computed "top-k" stats that wecan use to assemble a dictionary beforehand, but those don't alwaysexist, and even when they do they aren't complete by definition, so wecould end up hiding valuable data in an "Other" category. So in practicewe often have to wait to collect all the data before we can startstreaming anything.

I'd like to propose a couple of modifications to the Arrow IPC formatsthat could help alleviate this problem:1) Allow multiple dictionary batches to use the same id. The vectors inall dictionary batches with the same id can be concatenated together torepresent the full dictionary with that id.2) Allow dictionary batches and record batches to be interleaved. Forthe streaming format, there could be an additional requirement that anydictionary key used in a record batch must have been defined in apreviously sent dictionary batch.

These changes would allow producers to send "delta" dictionary batchesin an Arrow stream to define new keys that will be used in future recordbatches. Here's an example stream with one column of city names, to helpillustrate the idea:


<SCHEMA>
<DICTIONARY id=0>
(0) "New York"
(1) "Seattle"
(2) "Washington, DC"

<RECORD BATCH 0>
0
1
2
1

<DICTIONARY id=0>
(3) "Chicago"
(4) "San Francisco"

<RECORD BATCH 1>
3
2
4
0
EOS


Decoded Data:
-------------
New York
Seattle
Washington, DC
Seattle
Chicago
Washington, DC
San Francisco
New York

I also think it can be valuable if the requirement mentioned in #2applies only to the streaming format, so that the random-access formatwould support dictionary batches following record batches. That wayproducers creating random-access files could start writing recordbatches before all the data for the dictionaries has been assembled.

I need to give Paul Taylor credit for this idea - he actually alreadywrote the JS arrow reader to combine dictionaries with the same id(https://github.com/apache/arrow/blob/master/js/src/reader/arrow.ts#L59),and it occurred to me that that could be a solution for us.


Thanks
Brian

[DISCUSS] Allow "delta" dictionary batches

Reply via email to