prodriguezdefino commented on code in PR #32512:
URL: https://github.com/apache/beam/pull/32512#discussion_r1803693956
##########
sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BeamRowToStorageApiProto.java:
##########
@@ -289,25 +307,44 @@ private static Object toProtoValue(
case ROW:
return messageFromBeamRow(fieldDescriptor.getMessageType(), (Row)
value, null, -1);
case ARRAY:
- List<Object> list = (List<Object>) value;
- @Nullable FieldType arrayElementType =
beamFieldType.getCollectionElementType();
- if (arrayElementType == null) {
- throw new RuntimeException("Unexpected null element type!");
- }
- return list.stream()
- .map(v -> toProtoValue(fieldDescriptor, arrayElementType, v))
- .collect(Collectors.toList());
case ITERABLE:
Iterable<Object> iterable = (Iterable<Object>) value;
@Nullable FieldType iterableElementType =
beamFieldType.getCollectionElementType();
if (iterableElementType == null) {
- throw new RuntimeException("Unexpected null element type!");
+ throw new RuntimeException("Unexpected null element type: " +
fieldDescriptor.getName());
}
- return StreamSupport.stream(iterable.spliterator(), false)
- .map(v -> toProtoValue(fieldDescriptor, iterableElementType, v))
- .collect(Collectors.toList());
+ // We currently only support maps as non-row or non-scalar element
types
+ // given that BigQuery does not support nested arrays. If the element
type is of map type
+ // we should flatten it given how is being translated (as a list of
proto(key, value).
Review Comment:
as you correctly stated, BQ does not support those nested container types,
but it does not support maps either.
currently a Beam user, when translating formats for BQ ingestion (from Avro,
Thrift, or others which support MAP natively), I need to inspect the schemas or
IDLs to understand if a MAP or a nested container is there and translate it
into something that works for BQ. This adds complexity to the pipelines, and
can be detrimental of the overall performance (because potential packing and
unpacking needed to translate data in the original format).
this change aims to aid that translation, for both cases MAP type and
ARRAY/ITERABLE of MAPs which are both supported in Beam Rows (used for
simplicity after original format translation) but not in the BQ storage write
proto translation. for MAP, we are making a structure decision for the
translation, and for ARRAY/ITERABLE of MAPs as well.
I agree, we are losing key functionalities from the original structures with
this translation decision (indexing and key uniqueness as starters), but I
think through improved documentation we can alert the users about this caveats
(which does not affect already existing pipelines given that this is a net-new
feature).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]