[ 
https://issues.apache.org/jira/browse/ARROW-7048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16969016#comment-16969016
 ] 

Liya Fan commented on ARROW-7048:
---------------------------------

[~emkornfi...@gmail.com] Agreed. Adding a constant to each offset is more 
efficient. 

> [Java] Support for combining multiple vectors under VectorSchemaRoot
> --------------------------------------------------------------------
>
>                 Key: ARROW-7048
>                 URL: https://issues.apache.org/jira/browse/ARROW-7048
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: Java
>            Reporter: Yogesh Tewari
>            Assignee: Liya Fan
>            Priority: Major
>
> Hi,
>  
> pyarrow.Table.combine_chunks provides a nice functionality of combining 
> multiple batch records under a single pyarrow.Table.
>  
> I am currently working on a downstream application which reads data from 
> BigQuery. BigQuery storage api supports data output in Arrow format but 
> streams data in many batches of size 1024 or less number of rows.
> It would be really nice to have Arrow Java api provide this functionality 
> under an abstraction like VectorSchemaRoot.
> After getting guidance from [~emkornfi...@gmail.com], I tried to write my own 
> implementation by copying data vector by vector using TransferPair's 
> copyValueSafe
> But, unless I am missing some thing obvious, turns out it only copies one 
> value at a time. That means a lot of looping trying copyValueSafe millions of 
> rows from source vector index to target vector index. Ideally I would want to 
> concatenate/link the underlying buffers rather than copying one cell at a 
> time.
>  
> Eg, if I have :
> {code:java}
> List<VectorSchemaRoot> batchList = new ArrayList<>();
> try (ArrowStreamReader reader = new ArrowStreamReader(new 
> ByteArrayInputStream(out.toByteArray()), allocator)) {
>     Schema schema = reader.getVectorSchemaRoot().getSchema();
>     for (int i = 0; i < 5; i++) {
>         // This will be loaded with new values on every call to loadNextBatch
>         VectorSchemaRoot readBatch = reader.getVectorSchemaRoot();
>         reader.loadNextBatch();
>         batchList.add(readBatch);
>     }
> }
> //VectorSchemaRoot.combineChunks(batchList, newVectorSchemaRoot);{code}
>  
> A method like VectorSchemaRoot.combineChunks(List<VectorSchemaRoot>)?
> I did read the VectorSchemaRoot discussion on 
> https://issues.apache.org/jira/browse/ARROW-6896 and am not sure if its the 
> right thing to use here.
>  
>  
> PS. Feel free to update the title of this feature request with more 
> appropriate wordings.
>  
> Cheers,
> Yogesh
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to