[ https://issues.apache.org/jira/browse/ARROW-7048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16968998#comment-16968998 ]
Micah Kornfield commented on ARROW-7048: ---------------------------------------- "For VariableWidthVectors, we need to transform the offset buffer to a delta buffer, do the copy, and then transform the delta buffer back to a partial sum buffer. This may involve another feature discussed in ARROW-6394." I don't think a transformation between the two is necessary. Would you simply need to add a constant to each set of offset (i.e. translation back and forth is more costly then necessary). > [Java] Support for combining multiple vectors under VectorSchemaRoot > -------------------------------------------------------------------- > > Key: ARROW-7048 > URL: https://issues.apache.org/jira/browse/ARROW-7048 > Project: Apache Arrow > Issue Type: New Feature > Components: Java > Reporter: Yogesh Tewari > Assignee: Liya Fan > Priority: Major > > Hi, > > pyarrow.Table.combine_chunks provides a nice functionality of combining > multiple batch records under a single pyarrow.Table. > > I am currently working on a downstream application which reads data from > BigQuery. BigQuery storage api supports data output in Arrow format but > streams data in many batches of size 1024 or less number of rows. > It would be really nice to have Arrow Java api provide this functionality > under an abstraction like VectorSchemaRoot. > After getting guidance from [~emkornfi...@gmail.com], I tried to write my own > implementation by copying data vector by vector using TransferPair's > copyValueSafe > But, unless I am missing some thing obvious, turns out it only copies one > value at a time. That means a lot of looping trying copyValueSafe millions of > rows from source vector index to target vector index. Ideally I would want to > concatenate/link the underlying buffers rather than copying one cell at a > time. > > Eg, if I have : > {code:java} > List<VectorSchemaRoot> batchList = new ArrayList<>(); > try (ArrowStreamReader reader = new ArrowStreamReader(new > ByteArrayInputStream(out.toByteArray()), allocator)) { > Schema schema = reader.getVectorSchemaRoot().getSchema(); > for (int i = 0; i < 5; i++) { > // This will be loaded with new values on every call to loadNextBatch > VectorSchemaRoot readBatch = reader.getVectorSchemaRoot(); > reader.loadNextBatch(); > batchList.add(readBatch); > } > } > //VectorSchemaRoot.combineChunks(batchList, newVectorSchemaRoot);{code} > > A method like VectorSchemaRoot.combineChunks(List<VectorSchemaRoot>)? > I did read the VectorSchemaRoot discussion on > https://issues.apache.org/jira/browse/ARROW-6896 and am not sure if its the > right thing to use here. > > > PS. Feel free to update the title of this feature request with more > appropriate wordings. > > Cheers, > Yogesh > > -- This message was sent by Atlassian Jira (v8.3.4#803005)