rangadi opened a new pull request, #41192:
URL: https://github.com/apache/spark/pull/41192
### What changes were proposed in this pull request?
Protobuf functions (`from_protobuf()` & `to_protobuf()`) take file path of a
descriptor file and use that for constructing Protobuf descriptors.
Main problem with how this is that the file is read many times (e.g. at each
executor). This is unnecessary and error prone. E.g. file contents may be
updated couple of days after a streaming query starts. That could lead to
various errors.
**The fix**: Use the byte content (which is serialized `FileDescritptorSet`
proto). We read the content from the file once and carry the byte buffer.
This also adds new API where we can pass the byte buffer directly. This is
useful when the users fetch the content themselves and passes it to Protobuf
functions. E.g. they could fetch it from S3, or extract it Python Protobuf
classes.
**Note to reviewers**: This includes a lot of updates to test files, mainly
because the interface change to pass the buffer. I have left a few PR comments
to help with the review.
### Why are the changes needed?
Described above.
### Does this PR introduce _any_ user-facing change?
Yes, this adds two new versions for `from_protobuf()` and `to_protobuf()`
API that take Protobuf bytes rather than file path.
### How was this patch tested?
- Unit tests
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]