[GitHub] [beam] TheNeuralBit commented on a change in pull request #14586: [WIP][BEAM-8933] bigquery arrow

GitBox Mon, 24 May 2021 10:50:08 -0700


TheNeuralBit commented on a change in pull request #14586:
URL: https://github.com/apache/beam/pull/14586#discussion_r638152763




##########
File path: 
sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/SchemaAndRecord.java
##########
@@ -18,23 +18,32 @@
 package org.apache.beam.sdk.io.gcp.bigquery;
 
 import com.google.api.services.bigquery.model.TableSchema;
+import javax.annotation.Nullable;
 import org.apache.avro.generic.GenericRecord;
+import org.apache.beam.sdk.values.Row;
 
 /**
  * A wrapper for a {@link GenericRecord} and the {@link TableSchema} 
representing the schema of the
  * table (or query) it was generated from.
  */
 public class SchemaAndRecord {
-  private final GenericRecord record;
+  private final Object record;
   private final TableSchema tableSchema;
 
-  public SchemaAndRecord(GenericRecord record, TableSchema tableSchema) {
+  public SchemaAndRecord(Object record, TableSchema tableSchema) {
     this.record = record;
     this.tableSchema = tableSchema;
   }
 
   public GenericRecord getRecord() {
-    return record;
+    if (!(record instanceof GenericRecord)) {
+      throw new IllegalStateException("Object is not GenericRecord");
+    }
+    return (GenericRecord) record;
+  }

Review comment:
       I think the best thing to do long-term would be to find and update all 
of that code, but I understand that could be pretty substantial scope creep, so 
there's some risk there (it could take a while to do, and even then it might 
create new issues that we'd need to debug and fix).  Let's call this Option (1).
   
   One thing I really don't want to do is keep the status quo from the previous 
PRs, which is to make `SchemaAndRecord` store an `Object` that could be a 
`GenericRecord` or a `Row`, and then test every record at execution time. 
Testing each element is unnecessary overhead. This option could be made 
slightly better if we just generate different code at pipeline construction 
time that handles either a `GenericRecord` or a `Row`, but it's still not ideal 
to store an `Object` in `SchemaAndRecord`. Let's call this Option (2).
   
   I can think of a couple other ways to handle this:
   - Option (3): Convert the `Row` instances to `GenericRecord` instances using 
`AvroUtils` code before loading them into a `SchemaAndRecord` instance. This 
isn't ideal from a performance perspective, but at least it would make for 
relatively clean code.
   - Option (4): Move the abstraction to a higher-level. Update the code in 
`BigQueryIO` that accepts a `SchemaAndRecord` so it can process either a 
`SchemaAndRecord` or a `Row`. In our case the `Row` would be a pointer to an 
element in an Arrow RecordBatch, but in the future we could re-use this path to 
get rid of `SchemaAndRecord` and process `Row` instances that are backed by 
Avro `GenericRecord`s. This is different from Option (1) in that we'd keep the 
code path for `SchemaAndRecord` around for now.
   
   As the one implementing the code I think you are best suited to make a 
decision between these Options. I can help out by breaking down some of my 
thoughts on these options though. Option 1, 4 have the most unknowns so are 
more risky but get us closer to the long-term goal. Option 2, 3 are pretty well 
known so not that risky to pursue. Option 2 will probably yield better 
performance, while Option 3 will probably yield cleaner code. Among those two 
I'd prefer cleaner code (3) for now, since something like 1,4 will get us 
performance benefits in the long-term.
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [beam] TheNeuralBit commented on a change in pull request #14586: [WIP][BEAM-8933] bigquery arrow

Reply via email to