[ 
https://issues.apache.org/jira/browse/ARROW-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16251621#comment-16251621
 ] 

Wes McKinney commented on ARROW-1808:
-------------------------------------

[~kou] I am going to start looking at this soon, since it may cause a little 
bit of disruption in the glib bindings. This is a moderately disruptive API 
change, but long-term it will be for the best. The idea is that the current 
{{arrow::RecordBatch}} is a "simple in-memory record batch". But the 
object-boxing requirements to produce a vector of 
{{std::shared_ptr<arrow::ArrayData>}} can be quite expensive for large record 
batches. 

Instead, we could have {{arrow::RecordBatch}} as an abstract interface with 
virtual function for column access, with the current incarnation of RecordBatch 
as a subclass. So we could also create an {{arrow::IpcRecordBatch}} that does 
late-materialization of the {{arrow::Array}} objects. So if you have 1000 
columns, you do not pay the cost of creating array objects for all of them if 
you only end up accessing a few columns in some analytics algorithm

> [C++] Make RecordBatch interface virtual to permit record batches that 
> lazy-materialize columns
> -----------------------------------------------------------------------------------------------
>
>                 Key: ARROW-1808
>                 URL: https://issues.apache.org/jira/browse/ARROW-1808
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Wes McKinney
>             Fix For: 0.8.0
>
>
> This should be looked at soon to prevent having to define a different virtual 
> interface for record batches. There are places where we are using the record 
> batch constructor directly, and in some third party code (like MapD), so this 
> might be good to get done for 0.8.0



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to