[GitHub] [spark] rangadi opened a new pull request, #41192: [SPARK-43530][PROTOBUF] Read descriptor file only once

via GitHub Tue, 16 May 2023 16:19:07 -0700


rangadi opened a new pull request, #41192:
URL: https://github.com/apache/spark/pull/41192


   ### What changes were proposed in this pull request?
   
   Protobuf functions (`from_protobuf()` & `to_protobuf()`) take file path of a 
descriptor file and use that for constructing Protobuf descriptors. 
   Main problem with how this is that the file is read many times (e.g. at each 
executor). This is unnecessary and error prone. E.g. file contents may be 
updated couple of days after a streaming query starts. That could lead to 
various errors. 
   
   **The fix**: Use the byte content (which is serialized `FileDescritptorSet` 
proto). We read the content from the file once and carry the byte buffer. 
   
   This also adds new API where we can pass the byte buffer directly. This is 
useful when the users fetch the content themselves and passes it to Protobuf 
functions. E.g. they could fetch it from S3, or extract it Python Protobuf 
classes. 
   
   **Note to reviewers**: This includes a lot of updates to test files, mainly 
because the interface change to pass the buffer. I have left a few PR comments 
to help with the review.
   
   ### Why are the changes needed?
   Described above.
   
   ### Does this PR introduce _any_ user-facing change?
   Yes, this adds two new versions for `from_protobuf()` and `to_protobuf()` 
API that take Protobuf bytes rather than file path. 
   
   ### How was this patch tested?
    - Unit tests
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] rangadi opened a new pull request, #41192: [SPARK-43530][PROTOBUF] Read descriptor file only once

Reply via email to