[
https://issues.apache.org/jira/browse/ARROW-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16251621#comment-16251621
]
Wes McKinney commented on ARROW-1808:
-------------------------------------
[~kou] I am going to start looking at this soon, since it may cause a little
bit of disruption in the glib bindings. This is a moderately disruptive API
change, but long-term it will be for the best. The idea is that the current
{{arrow::RecordBatch}} is a "simple in-memory record batch". But the
object-boxing requirements to produce a vector of
{{std::shared_ptr<arrow::ArrayData>}} can be quite expensive for large record
batches.
Instead, we could have {{arrow::RecordBatch}} as an abstract interface with
virtual function for column access, with the current incarnation of RecordBatch
as a subclass. So we could also create an {{arrow::IpcRecordBatch}} that does
late-materialization of the {{arrow::Array}} objects. So if you have 1000
columns, you do not pay the cost of creating array objects for all of them if
you only end up accessing a few columns in some analytics algorithm
> [C++] Make RecordBatch interface virtual to permit record batches that
> lazy-materialize columns
> -----------------------------------------------------------------------------------------------
>
> Key: ARROW-1808
> URL: https://issues.apache.org/jira/browse/ARROW-1808
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Wes McKinney
> Fix For: 0.8.0
>
>
> This should be looked at soon to prevent having to define a different virtual
> interface for record batches. There are places where we are using the record
> batch constructor directly, and in some third party code (like MapD), so this
> might be good to get done for 0.8.0
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)