jklamer opened a new pull request #1602:
URL: https://github.com/apache/avro/pull/1602


   ### Description
   In the Encoding and Decoding recursive flows, the schema definitions are 
being added to a map ad hoc. This means for input values that do not recurse to 
the schema definition needed by a later schema ref, there is an encoding error 
when both the input schema and input Value/bytes are valid. 
   Ex: The following schema fails to encode/decode when a is the first union 
variant (null)
   ```json
   {
               "type":"record",
               "name":"TestStruct",
               "fields": [
                   {
                       "name":"a",
                       "type":[ "null", {
                           "type":"record",
                           "name": "Inner",
                           "fields": [ {
                               "name":"z",
                               "type":"int"
                           }]
                       }]
                   },
                   {
                       "name":"b",
                       "type":"Inner"
                   }
               ]
           }
   ```
   
   In the parsing, encoding, and decoding recursive workflow, the enclosing 
namespace needed to globally uniquely define a named schema type is not being 
passed recursively to help fully qualify a schema's name. This leads to the 
potential of ambiguous or undefined schema errors.
   Ex: the following fails to parse when it should
   ```json
   {
             "name": "record_name",
             "namespace": "space",
             "type": "record",
             "fields": [
               {
                 "name": "outer_field_1",
                 "type": [
                           "null",
                           {
                               "type":"record",
                               "name":"inner_record_name",
                               "fields":[
                                   {
                                       "name":"inner_field_1",
                                       "type":"double"
                                   }
                               ]
                           }
                       ]
               },
               {
                   "name": "outer_field_2",
                   "type" : "space.inner_record_name"
               }
             ]
           }
   ```
   Ex the following succeeds when it should fail
   ```
   {
               "type":"record",
               "name":"outer",
               "namesapce":"outer.space",
               "fields":[
                   {
                       "name": "field1",
                       "type": {
                           "type" : "record",
                           "name" : "middle",
                           "namespace":"middle.space",
                           "fields":[
                               {
                                   "name":"middle_field_1",
                                   "type": {
                                       "type":"record",
                                       "name":"inner",
                                       "fields":[
                                           {
                                               "name":"inner_field_1",
                                               "type":"double"
                                           }
                                       ]
                                   }
                               }
                           ]
                       }
                   },
                   {
                       "name":"field2",
                       "type":"inner"
                   }
               ]
           }
   ```
   
   The proposed solution here extends from the conversation has in the comments 
of https://github.com/apache/avro/pull/1582 by fulling resolving all the names 
to schema reference before all encoding and decoding workflows. In addition, 
this PR also updates the classes, and types that work with names to make them 
easier to work with. It uses these updated classes and types to passes a 
namespace type recursively so it can be use in fully qualifying the name of 
Named types in parsing, encoding and decoding workflows. The fully qualified 
name is ultimately what is used for schema lookup to avoid all ambiguity.
   
   ### Breaking Changes (cosmetic not function)
   - Aliases moved out of the name struct to match with Java SDK (something 
suggested by @[travisbrown](https://github.com/travisbrown) that I echo), and 
for clarity within this flow that uses names as a HashTable index.
   - encode_ref now includes the names lookup table and enclosing namespace 
input arguments. Can be replaces by the encode function which I believe was 
serving the same purpose)
   
   ### Future work
   - Use the fulling indexed schema (now called resolved schema) to help speed 
up encoding and decoding workflows ( avoid indexing schemas every time)
   - Better type/name resolution error messages
   - Avoid panics on encoding flow and set up AvroResult<()> returns with 
proper errors
   
   ### Jira
   https://issues.apache.org/jira/browse/AVRO-3448
   
   ### Tests
   - My PR adds unit tests that are design to test taking the name/namespacing 
spec to the extreme.
   - I including many of the same schema in tests for the 
parsing/encoding/decoding workflows. The test dependency (have to parse a 
schema to encode with it, have to encode with a schema before decoding with it, 
etc) 
   - All previous unit tests unchanged except for trivial constructor changes. 
   
   ### Documentation
   - I have added docs in some places where I believe the flow or use case to 
be non obvious
   - more can be added based on review feedback and what gets questions. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to