rangadi opened a new pull request, #40983:
URL: https://github.com/apache/spark/pull/40983
### What changes were proposed in this pull request?
This adds an option to convert Protobuf 'Any' fields to JSON. At runtime
such 'Any' fields
can contain arbitrary Protobuf message serialized as binary data.
By default when this option is not enabled, such field behaves like normal
Protobuf message
with two fields (`STRUCT<type_url: STRING, value: BINARY>`). The binary
`value` field is not
interpreted. This might not be convenient in practice.
One option is to deserialize it into actual Protobuf message and convert it
to Spark STRUCT.
But this is not feasible since the schema for `from_protobuf()` is needed at
query compile
time and can not change at run time. As a result this is not feasible.
Another option is parse the binary and deserialize the Protobuf message into
JSON string.
This this lot more readable than the binary data. This configuration option
enables
converting Any fields to JSON. The example blow clarifies further.
Consider two Protobuf types defined as follows:
```
message ProtoWithAny {
string event_name = 1;
google.protobuf.Any details = 2;
}
message Person {
string name = 1;
int32 id = 2;
}
```
With this option enabled, schema for `from_protobuf("col", messageName =
"ProtoWithAny")`
would be : `STRUCT<event_name: STRING, details: STRING>`.
At run time, if `details` field contains `Person` Protobuf message, the
returned value looks
like this with JSON string for `details`:
('click',
'{"@type":"type.googleapis.com/...ProtoWithAny","name":"Mario","id":100}')
Requirements:
- The definitions for all the possible Protobuf types that are used in Any
fields should be
available in the Protobuf descriptor file passed to `from_protobuf()`. If
any Protobuf
is not found, it will result in error for that record.
- This feature is supported with Java classes as well. But only the
Protobuf types defined
in the same `proto` file as the primary Java class might be visible.
E.g. if `ProtoWithAny` and `Person` in above example are in different
proto files,
definition for `Person` may not be found.
### Why are the changes needed?
Improves handling of Any fields.
### Does this PR introduce _any_ user-facing change?
No. Default behavior is not changed
### How was this patch tested?
- Unit tests
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]