[
https://issues.apache.org/jira/browse/ARROW-2532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16719148#comment-16719148
]
Wes McKinney commented on ARROW-2532:
-------------------------------------
Let's take a closer look after I get my patches up for those issues (as I've
taken the butcher's blade to builder.h/builder*.cc).
I think that introducing chunking into the base ArrayBuilder would introduce
too much API complexity for simple use cases -- so having as
minimal-as-possible builders that produce single arrays is a good thing. I
would say we should define the public API that we desire for chunked builders
and work backwards to the most efficient and easiest-to-maintain (in terms of
code duplication or lack thereof) implementation.
I was surprised to see that the chunked binary builder I implemented
outperformed the non-chunked benchmark on identical data (unless I made a
mistake in the implementation). This suggests that we might want to use
chunking more liberally in data ingest code (CSV, Parquet, etc.) -- e.g.
capping array chunks at 8-16MB or so.
> [C++] Add chunked builder classes
> ---------------------------------
>
> Key: ARROW-2532
> URL: https://issues.apache.org/jira/browse/ARROW-2532
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Affects Versions: 0.9.0
> Reporter: Antoine Pitrou
> Priority: Major
>
> I think it would be useful to have chunked builders for list, string and
> binary types. A chunked builder would produce a chunked array as output,
> circumventing the 32-bit offset limit of those types. There's some
> special-casing scatterred around our Numpy conversion routines right now.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)