rangadi opened a new pull request, #40983:
URL: https://github.com/apache/spark/pull/40983

   ### What changes were proposed in this pull request?
   
   This adds an option to convert Protobuf 'Any' fields to JSON. At runtime 
such 'Any' fields   
   can contain arbitrary Protobuf message serialized as binary data.            
                
                                                                                
                
   By default when this option is not enabled, such field behaves like normal 
Protobuf message  
   with two fields (`STRUCT<type_url: STRING, value: BINARY>`). The binary 
`value` field is not 
   interpreted. This might not be convenient in practice.                       
                
                                                                                
                
   One option is to deserialize it into actual Protobuf message and convert it 
to Spark STRUCT. 
   But this is not feasible since the schema for `from_protobuf()` is needed at 
query compile   
   time and can not change at run time. As a result this is not feasible.       
                
                                                                                
                
   Another option is parse the binary and deserialize the Protobuf message into 
JSON string.    
   This this lot more readable than the binary data. This configuration option 
enables          
   converting Any fields to JSON. The example blow clarifies further.           
                 
                                                                                
                
    Consider two Protobuf types defined as follows:                             
                
   ```
      message ProtoWithAny {                                                    
                
         string event_name = 1;                                                 
                
         google.protobuf.Any details = 2;                                       
                
      }                                                                         
                
                                                                                
                
      message Person {                                                          
                
        string name = 1;                                                        
                
        int32 id = 2;                                                           
                
     }                                                                          
                
   ```                                                                          
                   
   With this option enabled, schema for `from_protobuf("col", messageName = 
"ProtoWithAny")`    
   would be : `STRUCT<event_name: STRING, details: STRING>`.                    
                
   At run time, if `details` field contains `Person` Protobuf message, the 
returned value looks 
   like this with JSON string for `details`:
                                                       
       ('click', 
'{"@type":"type.googleapis.com/...ProtoWithAny","name":"Mario","id":100}')      
 
                                                                                
                
   Requirements:                                                                
                
    - The definitions for all the possible Protobuf types that are used in Any 
fields should be 
      available in the Protobuf descriptor file passed to `from_protobuf()`. If 
any Protobuf    
      is not found, it will result in error for that record.                    
                
    - This feature is supported with Java classes as well. But only the 
Protobuf types defined  
      in the same `proto` file as the primary Java class might be visible.      
                
      E.g. if `ProtoWithAny` and `Person` in above example are in different 
proto files,        
      definition for `Person` may not be found.                                 
                
   
   ### Why are the changes needed?
   
   Improves handling of Any fields.
   
   
   ### Does this PR introduce _any_ user-facing change?
   No. Default behavior is not changed
   
   
   ### How was this patch tested?
   - Unit tests
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to