[jira] [Work logged] (BEAM-8933) BigQuery IO should support reading Arrow format over Storage API

ASF GitHub Bot (Jira) Fri, 04 Jun 2021 01:00:37 -0700


     [ 
https://issues.apache.org/jira/browse/BEAM-8933?focusedWorklogId=606583&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-606583
 ]


ASF GitHub Bot logged work on BEAM-8933:
----------------------------------------

                Author: ASF GitHub Bot
            Created on: 04/Jun/21 08:00
            Start Date: 04/Jun/21 08:00
    Worklog Time Spent: 10m 
      Work Description: TheNeuralBit commented on a change in pull request 
#14586:
URL: https://github.com/apache/beam/pull/14586#discussion_r645147684



##########
File path: 
sdks/java/io/google-cloud-platform/src/test/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIOStorageReadTest.java
##########
@@ -1351,4 +1353,20 @@ public void testReadFromBigQueryIOWithTrimmedSchema() 
throws Exception {
 
     p.run();
   }
+
+  private static org.apache.arrow.vector.types.pojo.Field field(

Review comment:
       Oh shoot sorry about that, I completely missed that there is a separate 
BigQuery `ArrowRecordBatch`. I think you're proposed approach is mostly 
correct, but rather than parsing the serialized record batch, you'll want to 
make a builder and set the appropriate values, (e.g. with 
[setSerializedRecordBatch](https://googleapis.dev/java/google-cloud-bigquerystorage/1.8.3/com/google/cloud/bigquery/storage/v1/ArrowRecordBatch.Builder.html#setSerializedRecordBatch-com.google.protobuf.ByteString-)):
   
   ```java
   ArrowRecordBatch bigqueryBatch = ArrowRecordBatch.newBuilder()
       .setRowCount(..)
       .setSerializedRecordBatch(serializedBytes)
       .build()
   ```
   
   > Is it correct for the Arrow code to use 
`org.apache.arrow.vector.ipc.message.ArrowRecordBatch` instead of 
`com.google.cloud.bigquery.storage.v1.ArrowRecordBatch`?
   
   Yes this is correct. The bigquery ArrowRecordBatch is only relevant for 
BigQueryIO, while `ArrowConversion` should be more general purpose (there may 
be other IOs that produce arrow data in the future).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 606583)
    Time Spent: 47h 10m  (was: 47h)

> BigQuery IO should support reading Arrow format over Storage API
> ----------------------------------------------------------------
>
>                 Key: BEAM-8933
>                 URL: https://issues.apache.org/jira/browse/BEAM-8933
>             Project: Beam
>          Issue Type: Improvement
>          Components: io-java-gcp
>            Reporter: Kirill Kozlov
>            Assignee: Miguel Anzo
>            Priority: P3
>          Time Spent: 47h 10m
>  Remaining Estimate: 0h
>
> As of right now BigQuery uses Avro format for reading and writing.
> We should add a config to BigQueryIO to specify which format to use: Arrow or 
> Avro (with Avro as default).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-8933) BigQuery IO should support reading Arrow format over Storage API

Reply via email to