ASF GitHub Bot commented on ARROW-2330:

alendit commented on issue #1769: ARROW-2330: [C++] Optimize delta buffer 
creation with partially finishable array builders
URL: https://github.com/apache/arrow/pull/1769#issuecomment-380165985
   Hi Uwe,
   I see what you mean. Now that I've looked more carefully into 
`BufferSlices`, I see how much danger they can bear. Something like 
[this](https://gist.github.com/alendit/f759bc12e03d9dd9d72cac90a6334cc5) would 
cause a read-after-free. 
   The problem, as I see it, is, that the SliceBuffer user has no way to ensure 
its validity. So even if we skip this PR, the problems with slices might happen 
in the future. I think some lean solution, like adding a `shared_ptr<Buffer> 
parent` to the slice and referencing its data instead, will increase the memory 
safety and might benefit this PR, too.
   Do you think we should discuss it on the mailing list?

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:

> [C++] Optimize delta buffer creation with partially finishable array builders
> -----------------------------------------------------------------------------
>                 Key: ARROW-2330
>                 URL: https://issues.apache.org/jira/browse/ARROW-2330
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: C++
>    Affects Versions: 0.8.0
>            Reporter: Dimitri Vorona
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 0.10.0
> The main aim of this change is to optimize the building of delta 
> dictionaries. In the current version delta dictionaries are built using an 
> additional "overflow" buffer which leads to complicated and potentially 
> error-prone code and subpar performance by doubling the number of lookups.
> I solve this problem by introducing the notion of partially finishable array 
> builders, i.e. builder which are able to retain the state on calling Finish. 
> The interface is based on RecordBatchBuilder::Flush, i.e. Finish is 
> overloaded with additional signature Finish(bool reset_builder, 
> std::shared_ptr<Array>* out). The resulting Arrays point to the same data 
> buffer with different offsets.
> I'm aware that the change is kind of biggish, but I'd like to discuss it 
> here. The solution makes the code more straight forward, doesn't bloat the 
> code base too much and leaves the API more or less untouched. Additionally, 
> the new way to make delta dictionaries by using a different call signature to 
> Finish feel cleaner to me.
> I'm looking forward to your critic and improvement ideas.
> The pull request is available at: https://github.com/apache/arrow/pull/1769

This message was sent by Atlassian JIRA

Reply via email to