[GitHub] [beam] reuvenlax commented on a diff in pull request #24145: Handle updates to table schema when using Storage API writes.

GitBox Wed, 11 Jan 2023 17:20:59 -0800


reuvenlax commented on code in PR #24145:
URL: https://github.com/apache/beam/pull/24145#discussion_r1067613845



##########
sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/SplittingIterable.java:
##########
@@ -57,7 +84,37 @@ public ProtoRows next() {
         while (underlyingIterator.hasNext()) {
           StorageApiWritePayload payload = underlyingIterator.next();
           ByteString byteString = ByteString.copyFrom(payload.getPayload());
-
+          if (autoUpdateSchema) {
+            try {
+              @Nullable TableRow unknownFields = payload.getUnknownFields();
+              if (unknownFields != null) {
+                // Protocol buffer serialization format supports 
concatenation. We serialize any new
+                // "known" fields
+                // into a proto and concatenate to the existing proto.
+                try {
+                  byteString =
+                      byteString.concat(

Review Comment:
   The prior convert message only includes fields known to it in the proto it 
generates. It can't include fields it doesn't know about as they would have to 
be in the proto descriptor (and it can't use the proto's unknown field set as 
that requires field ids, which is not known yet).
   
   Therefore the incoming byteString contains only fields that were known to 
the convert stage, and all other fields are put into the json unknownFields 
object. What we are doing here is taking advantage of the fact that the write 
step has a more up-to-date view on the schema, so we walk over the 
unknownFields json and extract whatever fields are now known (which might still 
be only a subset of the remaining fields). We then convert those unknownFields 
to a proto, and concatenate the two protos.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [beam] reuvenlax commented on a diff in pull request #24145: Handle updates to table schema when using Storage API writes.

Reply via email to