[ 
https://issues.apache.org/jira/browse/ARROW-7048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16968998#comment-16968998
 ] 

Micah Kornfield commented on ARROW-7048:
----------------------------------------

"For VariableWidthVectors, we need to transform the offset buffer to a delta 
buffer, do the copy, and then transform the delta buffer back to a partial sum 
buffer. This may involve another feature discussed in ARROW-6394."

I don't think a transformation between the two is necessary.  Would you simply 
need to add a constant to each set of offset (i.e. translation back and forth 
is more costly then necessary).

> [Java] Support for combining multiple vectors under VectorSchemaRoot
> --------------------------------------------------------------------
>
>                 Key: ARROW-7048
>                 URL: https://issues.apache.org/jira/browse/ARROW-7048
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: Java
>            Reporter: Yogesh Tewari
>            Assignee: Liya Fan
>            Priority: Major
>
> Hi,
>  
> pyarrow.Table.combine_chunks provides a nice functionality of combining 
> multiple batch records under a single pyarrow.Table.
>  
> I am currently working on a downstream application which reads data from 
> BigQuery. BigQuery storage api supports data output in Arrow format but 
> streams data in many batches of size 1024 or less number of rows.
> It would be really nice to have Arrow Java api provide this functionality 
> under an abstraction like VectorSchemaRoot.
> After getting guidance from [~emkornfi...@gmail.com], I tried to write my own 
> implementation by copying data vector by vector using TransferPair's 
> copyValueSafe
> But, unless I am missing some thing obvious, turns out it only copies one 
> value at a time. That means a lot of looping trying copyValueSafe millions of 
> rows from source vector index to target vector index. Ideally I would want to 
> concatenate/link the underlying buffers rather than copying one cell at a 
> time.
>  
> Eg, if I have :
> {code:java}
> List<VectorSchemaRoot> batchList = new ArrayList<>();
> try (ArrowStreamReader reader = new ArrowStreamReader(new 
> ByteArrayInputStream(out.toByteArray()), allocator)) {
>     Schema schema = reader.getVectorSchemaRoot().getSchema();
>     for (int i = 0; i < 5; i++) {
>         // This will be loaded with new values on every call to loadNextBatch
>         VectorSchemaRoot readBatch = reader.getVectorSchemaRoot();
>         reader.loadNextBatch();
>         batchList.add(readBatch);
>     }
> }
> //VectorSchemaRoot.combineChunks(batchList, newVectorSchemaRoot);{code}
>  
> A method like VectorSchemaRoot.combineChunks(List<VectorSchemaRoot>)?
> I did read the VectorSchemaRoot discussion on 
> https://issues.apache.org/jira/browse/ARROW-6896 and am not sure if its the 
> right thing to use here.
>  
>  
> PS. Feel free to update the title of this feature request with more 
> appropriate wordings.
>  
> Cheers,
> Yogesh
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to