Re: [PR] Support Map and Arrays of Maps in BQ for StorageWrites API for Beam Rows [beam]

via GitHub Wed, 16 Oct 2024 12:40:18 -0700


prodriguezdefino commented on code in PR #32512:
URL: https://github.com/apache/beam/pull/32512#discussion_r1803693956



##########
sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BeamRowToStorageApiProto.java:
##########
@@ -289,25 +307,44 @@ private static Object toProtoValue(
       case ROW:
         return messageFromBeamRow(fieldDescriptor.getMessageType(), (Row) 
value, null, -1);
       case ARRAY:
-        List<Object> list = (List<Object>) value;
-        @Nullable FieldType arrayElementType = 
beamFieldType.getCollectionElementType();
-        if (arrayElementType == null) {
-          throw new RuntimeException("Unexpected null element type!");
-        }
-        return list.stream()
-            .map(v -> toProtoValue(fieldDescriptor, arrayElementType, v))
-            .collect(Collectors.toList());
       case ITERABLE:
         Iterable<Object> iterable = (Iterable<Object>) value;
         @Nullable FieldType iterableElementType = 
beamFieldType.getCollectionElementType();
         if (iterableElementType == null) {
-          throw new RuntimeException("Unexpected null element type!");
+          throw new RuntimeException("Unexpected null element type: " + 
fieldDescriptor.getName());
         }
-        return StreamSupport.stream(iterable.spliterator(), false)
-            .map(v -> toProtoValue(fieldDescriptor, iterableElementType, v))
-            .collect(Collectors.toList());
+        // We currently only support maps as non-row or non-scalar element 
types
+        // given that BigQuery does not support nested arrays. If the element 
type is of map type
+        // we should flatten it given how is being translated (as a list of 
proto(key, value).

Review Comment:
   as you correctly stated, BQ does not support those nested container types, 
but it does not support maps either. 
   
   currently a Beam user, when translating formats for BQ ingestion (from Avro, 
Thrift, or others which support MAP natively), I need to inspect the schemas or 
IDLs to understand if a MAP or a nested container is there and translate it 
into something that works for BQ. This adds complexity to the pipelines, and 
can be detrimental of the overall performance (because potential packing and 
unpacking needed to translate data in the original format). 
   
   this change aims to aid that translation, for both cases MAP type and 
ARRAY/ITERABLE of MAPs which are both supported in Beam Rows (used for 
simplicity after original format translation) but not in the BQ storage write 
proto translation. for MAP, we are making a structure decision for the 
translation, and for ARRAY/ITERABLE of MAPs as well. 
   
   I agree, we are losing key functionalities from the original structures with 
this translation decision (indexing and key uniqueness as starters), but I 
think through improved documentation we can alert the users about this caveats 
(which does not affect already existing pipelines given that this is a net-new 
feature).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Support Map and Arrays of Maps in BQ for StorageWrites API for Beam Rows [beam]

Reply via email to