Dimitri Vorona created ARROW-2176:
-------------------------------------

             Summary: [C++] Extend DictionaryBuilder to support delta 
dictionaries
                 Key: ARROW-2176
                 URL: https://issues.apache.org/jira/browse/ARROW-2176
             Project: Apache Arrow
          Issue Type: New Feature
          Components: C++
            Reporter: Dimitri Vorona
             Fix For: 0.9.0


[The IPC format|https://arrow.apache.org/docs/ipc.html] specifies a possibility 
of sending additional dictionary batches with a previously seen id and a 
isDelta flag to extend the existing dictionaries with new entries. Right now, 
the DictioniaryBuilder (as well as IPC writer and reader) do not support 
generation of delta dictionaries.

This pull request contains a basic implementation of the DictionaryBuilder with 
delta dictionaries support. The use API can be seen in the dictionary tests 
(i.e. 
[here|https://github.com/alendit/arrow/blob/delta_dictionary_builder/cpp/src/arrow/array-test.cc#L1773]).
 The basic idea is that the user just reuses the builder object after calling 
Finish(Array*) for the first time. Subsequent calls to Append will create new 
entries only for the unseen element and reuse id from previous dictionaries for 
the seen ones.

Some considerations:
 # The API is pretty implicit, and additional flag for Finish, which explicitly 
indicates a desire to use the builder for delta dictionary generation might be 
expedient from the error avoidance point of view.
 # Right now the implementation uses an additional "overflow dictionary" to 
store the seen items. This adds a copy on each Finish call and an additional 
lookup at each GetItem or Append call. I assume, we might get away with 
returning Array slices at Finish, which would remove the need for an additional 
overflow dictionary. If the gist of the PR is approved, I can look into further 
optimizations.

The Writer and Reader extensions would be pretty simple, since the 
DictionaryBuilder API remains basically the same. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to