[jira] [Work logged] (BEAM-8933) BigQuery IO should support reading Arrow format over Storage API

ASF GitHub Bot (Jira) Fri, 04 Jun 2021 00:54:09 -0700


     [ 
https://issues.apache.org/jira/browse/BEAM-8933?focusedWorklogId=606520&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-606520
 ]


ASF GitHub Bot logged work on BEAM-8933:
----------------------------------------

                Author: ASF GitHub Bot
            Created on: 04/Jun/21 07:53
            Start Date: 04/Jun/21 07:53
    Worklog Time Spent: 10m 
      Work Description: MiguelAnzoWizeline commented on a change in pull 
request #14586:
URL: https://github.com/apache/beam/pull/14586#discussion_r644373119



##########
File path: 
sdks/java/io/google-cloud-platform/src/test/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIOStorageReadTest.java
##########
@@ -1351,4 +1353,20 @@ public void testReadFromBigQueryIOWithTrimmedSchema() 
throws Exception {
 
     p.run();
   }
+
+  private static org.apache.arrow.vector.types.pojo.Field field(

Review comment:
       Hi @TheNeuralBit I’m still having some problems with this 
implementation, I followed the examples you gave me, which helped me understand 
more the Arrow side, but when I was implementing it I found that 
`ReadRowsResponse` uses a different `ArrowRecordBatch` implementation found in 
`com.google.cloud.bigquery.storage.v1.ArrowRecordBatch` and with that now I 
have two questions:
   - In order to convert from 
`org.apache.arrow.vector.ipc.message.ArrowRecordBatch` to 
`com.google.cloud.bigquery.storage.v1.ArrowRecordBatch`, from what I can see I 
would need to write the `org.apache.arrow.vector.ipc.message.ArrowRecordBatch` 
to an InputStream and then 
[parse](https://googleapis.dev/java/google-cloud-bigquerystorage/1.8.3/com/google/cloud/bigquery/storage/v1/ArrowRecordBatch.html#parseFrom-com.google.protobuf.CodedInputStream-)
 
[it](https://googleapis.dev/java/google-cloud-bigquerystorage/1.8.3/com/google/cloud/bigquery/storage/v1/ArrowRecordBatch.html#parseFrom-java.io.InputStream-)
 to `com.google.cloud.bigquery.storage.v1.ArrowRecordBatch`, is that correct? 
and is there a specific coder needed to do that? I found the `Avro` 
implementation uses a `GenericDatumWriter`. Is there any other more 
straightforward way to do that conversion?
   - Is it correct for the Arrow 
[code](https://github.com/apache/beam/blob/ed63ed19c67c7c0a1ce7c2888c44fb1238419c32/sdks/java/extensions/arrow/src/main/java/org/apache/beam/sdk/extensions/arrow/ArrowConversion.java)
 to use `org.apache.arrow.vector.ipc.message.ArrowRecordBatch` instead of 
`com.google.cloud.bigquery.storage.v1.ArrowRecordBatch`?
   Thanks




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 606520)
    Time Spent: 47h  (was: 46h 50m)

> BigQuery IO should support reading Arrow format over Storage API
> ----------------------------------------------------------------
>
>                 Key: BEAM-8933
>                 URL: https://issues.apache.org/jira/browse/BEAM-8933
>             Project: Beam
>          Issue Type: Improvement
>          Components: io-java-gcp
>            Reporter: Kirill Kozlov
>            Assignee: Miguel Anzo
>            Priority: P3
>          Time Spent: 47h
>  Remaining Estimate: 0h
>
> As of right now BigQuery uses Avro format for reading and writing.
> We should add a config to BigQueryIO to specify which format to use: Arrow or 
> Avro (with Avro as default).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-8933) BigQuery IO should support reading Arrow format over Storage API

Reply via email to