[ 
https://issues.apache.org/jira/browse/ARROW-2532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16719148#comment-16719148
 ] 

Wes McKinney commented on ARROW-2532:
-------------------------------------

Let's take a closer look after I get my patches up for those issues (as I've 
taken the butcher's blade to builder.h/builder*.cc).

I think that introducing chunking into the base ArrayBuilder would introduce 
too much API complexity for simple use cases -- so having as 
minimal-as-possible builders that produce single arrays is a good thing. I 
would say we should define the public API that we desire for chunked builders 
and work backwards to the most efficient and easiest-to-maintain (in terms of 
code duplication or lack thereof) implementation.

I was surprised to see that the chunked binary builder I implemented 
outperformed the non-chunked benchmark on identical data (unless I made a 
mistake in the implementation). This suggests that we might want to use 
chunking more liberally in data ingest code (CSV, Parquet, etc.) -- e.g. 
capping array chunks at 8-16MB or so.

> [C++] Add chunked builder classes
> ---------------------------------
>
>                 Key: ARROW-2532
>                 URL: https://issues.apache.org/jira/browse/ARROW-2532
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>    Affects Versions: 0.9.0
>            Reporter: Antoine Pitrou
>            Priority: Major
>
> I think it would be useful to have chunked builders for list, string and 
> binary types. A chunked builder would produce a chunked array as output, 
> circumventing the 32-bit offset limit of those types. There's some 
> special-casing scatterred around our Numpy conversion routines right now.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to