[
https://issues.apache.org/jira/browse/ARROW-2176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dimitri Vorona updated ARROW-2176:
----------------------------------
External issue URL: https://github.com/apache/arrow/pull/1629
> [C++] Extend DictionaryBuilder to support delta dictionaries
> ------------------------------------------------------------
>
> Key: ARROW-2176
> URL: https://issues.apache.org/jira/browse/ARROW-2176
> Project: Apache Arrow
> Issue Type: New Feature
> Components: C++
> Reporter: Dimitri Vorona
> Priority: Major
> Labels: pull-request-available
> Fix For: 0.9.0
>
>
> [The IPC format|https://arrow.apache.org/docs/ipc.html] specifies a
> possibility of sending additional dictionary batches with a previously seen
> id and a isDelta flag to extend the existing dictionaries with new entries.
> Right now, the DictioniaryBuilder (as well as IPC writer and reader) do not
> support generation of delta dictionaries.
> This pull request contains a basic implementation of the DictionaryBuilder
> with delta dictionaries support. The use API can be seen in the dictionary
> tests (i.e.
> [here|https://github.com/alendit/arrow/blob/delta_dictionary_builder/cpp/src/arrow/array-test.cc#L1773]).
> The basic idea is that the user just reuses the builder object after calling
> Finish(Array*) for the first time. Subsequent calls to Append will create new
> entries only for the unseen element and reuse id from previous dictionaries
> for the seen ones.
> Some considerations:
> # The API is pretty implicit, and additional flag for Finish, which
> explicitly indicates a desire to use the builder for delta dictionary
> generation might be expedient from the error avoidance point of view.
> # Right now the implementation uses an additional "overflow dictionary" to
> store the seen items. This adds a copy on each Finish call and an additional
> lookup at each GetItem or Append call. I assume, we might get away with
> returning Array slices at Finish, which would remove the need for an
> additional overflow dictionary. If the gist of the PR is approved, I can look
> into further optimizations.
> The Writer and Reader extensions would be pretty simple, since the
> DictionaryBuilder API remains basically the same.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)