ASF GitHub Bot commented on ARROW-2176:

alendit opened a new pull request #1629: ARROW-2176: [C++] Extend 
DictionaryBuilder to support delta dictionaries
URL: https://github.com/apache/arrow/pull/1629

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:

> [C++] Extend DictionaryBuilder to support delta dictionaries
> ------------------------------------------------------------
>                 Key: ARROW-2176
>                 URL: https://issues.apache.org/jira/browse/ARROW-2176
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: C++
>            Reporter: Dimitri Vorona
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 0.9.0
> [The IPC format|https://arrow.apache.org/docs/ipc.html] specifies a 
> possibility of sending additional dictionary batches with a previously seen 
> id and a isDelta flag to extend the existing dictionaries with new entries. 
> Right now, the DictioniaryBuilder (as well as IPC writer and reader) do not 
> support generation of delta dictionaries.
> This pull request contains a basic implementation of the DictionaryBuilder 
> with delta dictionaries support. The use API can be seen in the dictionary 
> tests (i.e. 
> [here|https://github.com/alendit/arrow/blob/delta_dictionary_builder/cpp/src/arrow/array-test.cc#L1773]).
>  The basic idea is that the user just reuses the builder object after calling 
> Finish(Array*) for the first time. Subsequent calls to Append will create new 
> entries only for the unseen element and reuse id from previous dictionaries 
> for the seen ones.
> Some considerations:
>  # The API is pretty implicit, and additional flag for Finish, which 
> explicitly indicates a desire to use the builder for delta dictionary 
> generation might be expedient from the error avoidance point of view.
>  # Right now the implementation uses an additional "overflow dictionary" to 
> store the seen items. This adds a copy on each Finish call and an additional 
> lookup at each GetItem or Append call. I assume, we might get away with 
> returning Array slices at Finish, which would remove the need for an 
> additional overflow dictionary. If the gist of the PR is approved, I can look 
> into further optimizations.
> The Writer and Reader extensions would be pretty simple, since the 
> DictionaryBuilder API remains basically the same. 

This message was sent by Atlassian JIRA

Reply via email to