[
https://issues.apache.org/jira/browse/ARROW-11878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Raúl Cumplido updated ARROW-11878:
----------------------------------
Fix Version/s: 9.0.0
(was: 8.0.0)
> [C++] Improve Converter API to support chunking
> -----------------------------------------------
>
> Key: ARROW-11878
> URL: https://issues.apache.org/jira/browse/ARROW-11878
> Project: Apache Arrow
> Issue Type: New Feature
> Components: C++
> Reporter: Neal Richardson
> Priority: Major
> Fix For: 9.0.0
>
>
> We would like to be able to chunk a data frame when converting to Arrow Table
> in R (see ARROW-9293). Apparently this is also not supported in pyarrow.
> [~romainfrancois] says two things need to happen:
> - Converter api needs to be able to Extend() a range of values, as opposed
> to the current api we have : {{Status Extend(SEXP x, int64_t size)}} override
> which says ingest that vector x and btw it has this many elements.
> - Chunker or perhaps another/new class would sit on top of that and perhaps
> {{Chunker::Extend(x)}} would call multiple times (one for each chunk)
> {{Converter$Extend(x, start, size)}}.
> The current chunker solves I believe a different problem and is rooted in a
> Converter that deals with elements one by one so that:
> - if the element can be Append() that’s fine
> - if not, then create a new chunk and try again
> The current chunker has a multiple element method but it’s an all or nothing:
> {code}
> // we could get bit smarter here since the whole batch of appendable values
> // will be rejected if a capacity error is raised
> Status Extend(InputType values, int64_t size) {
> auto status = converter_->Extend(values, size);
> if (ARROW_PREDICT_FALSE(status.IsCapacityError())) {
> if (converter_->builder()->length() == 0) {
> return status;
> }
> ARROW_RETURN_NOT_OK(FinishChunk());
> return Extend(values, size);
> }
> length_ += size;
> return status;
> }
> {code}
> This does not give a way to say e.g. take this vector and chunk it into
> arrays of this size, which is what we want.
> cc [~kszucs] [~bkietz]
--
This message was sent by Atlassian Jira
(v8.20.7#820007)