[ 
https://issues.apache.org/jira/browse/ARROW-2176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn reassigned ARROW-2176:
----------------------------------

    Assignee: Dimitri Vorona

> [C++] Extend DictionaryBuilder to support delta dictionaries
> ------------------------------------------------------------
>
>                 Key: ARROW-2176
>                 URL: https://issues.apache.org/jira/browse/ARROW-2176
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: C++
>            Reporter: Dimitri Vorona
>            Assignee: Dimitri Vorona
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 0.9.0
>
>
> [The IPC format|https://arrow.apache.org/docs/ipc.html] specifies a 
> possibility of sending additional dictionary batches with a previously seen 
> id and a isDelta flag to extend the existing dictionaries with new entries. 
> Right now, the DictioniaryBuilder (as well as IPC writer and reader) do not 
> support generation of delta dictionaries.
> This pull request contains a basic implementation of the DictionaryBuilder 
> with delta dictionaries support. The use API can be seen in the dictionary 
> tests (i.e. 
> [here|https://github.com/alendit/arrow/blob/delta_dictionary_builder/cpp/src/arrow/array-test.cc#L1773]).
>  The basic idea is that the user just reuses the builder object after calling 
> Finish(Array*) for the first time. Subsequent calls to Append will create new 
> entries only for the unseen element and reuse id from previous dictionaries 
> for the seen ones.
> Some considerations:
>  # The API is pretty implicit, and additional flag for Finish, which 
> explicitly indicates a desire to use the builder for delta dictionary 
> generation might be expedient from the error avoidance point of view.
>  # Right now the implementation uses an additional "overflow dictionary" to 
> store the seen items. This adds a copy on each Finish call and an additional 
> lookup at each GetItem or Append call. I assume, we might get away with 
> returning Array slices at Finish, which would remove the need for an 
> additional overflow dictionary. If the gist of the PR is approved, I can look 
> into further optimizations.
> The Writer and Reader extensions would be pretty simple, since the 
> DictionaryBuilder API remains basically the same. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to