[
https://issues.apache.org/jira/browse/ARROW-7048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Liya Fan reassigned ARROW-7048:
-------------------------------
Assignee: Liya Fan
> [Java] Support for combining multiple vectors under VectorSchemaRoot
> --------------------------------------------------------------------
>
> Key: ARROW-7048
> URL: https://issues.apache.org/jira/browse/ARROW-7048
> Project: Apache Arrow
> Issue Type: New Feature
> Components: Java
> Reporter: Yogesh Tewari
> Assignee: Liya Fan
> Priority: Major
>
> Hi,
>
> pyarrow.Table.combine_chunks provides a nice functionality of combining
> multiple batch records under a single pyarrow.Table.
>
> I am currently working on a downstream application which reads data from
> BigQuery. BigQuery storage api supports data output in Arrow format but
> streams data in many batches of size 1024 or less number of rows.
> It would be really nice to have Arrow Java api provide this functionality
> under an abstraction like VectorSchemaRoot.
> After getting guidance from [[email protected]], I tried to write my own
> implementation by copying data vector by vector using TransferPair's
> copyValueSafe
> But, unless I am missing some thing obvious, turns out it only copies one
> value at a time. That means a lot of looping trying copyValueSafe millions of
> rows from source vector index to target vector index. Ideally I would want to
> concatenate/link the underlying buffers rather than copying one cell at a
> time.
>
> Eg, if I have :
> {code:java}
> List<VectorSchemaRoot> batchList = new ArrayList<>();
> try (ArrowStreamReader reader = new ArrowStreamReader(new
> ByteArrayInputStream(out.toByteArray()), allocator)) {
> Schema schema = reader.getVectorSchemaRoot().getSchema();
> for (int i = 0; i < 5; i++) {
> // This will be loaded with new values on every call to loadNextBatch
> VectorSchemaRoot readBatch = reader.getVectorSchemaRoot();
> reader.loadNextBatch();
> batchList.add(readBatch);
> }
> }
> //VectorSchemaRoot.combineChunks(batchList, newVectorSchemaRoot);{code}
>
> A method like VectorSchemaRoot.combineChunks(List<VectorSchemaRoot>)?
> I did read the VectorSchemaRoot discussion on
> https://issues.apache.org/jira/browse/ARROW-6896 and am not sure if its the
> right thing to use here.
>
>
> PS. Feel free to update the title of this feature request with more
> appropriate wordings.
>
> Cheers,
> Yogesh
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)