[ 
https://issues.apache.org/jira/browse/AVRO-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13697022#comment-13697022
 ] 

Jeremy Kahn commented on AVRO-1343:
-----------------------------------

It causes problems for unioned data in Python, because Python moves to generic 
data and then introspects the data with {{validate}} to determine which union 
member to use to re-encode the data.

Suppose I start with a schema:
{code}{"type": "record", "name": "superset",
  "fields": [ {"name": "foo", "type": "int" },
              {"name": "bar", "type": "string"} ] }
{code}
If I encode these two lines with a schema of _only_ {{superset}} objects:
{code}
  {"foo": 99, "bar": "banana"}
  {"foo": -98, "bar": "peaches"}{code}
the data is entirely recoverable.   But if I rewrite that datafile with a 
schema supporting a union of {{superset}} and {{subset}}
{code}
[{"type": "record", "name": "superset",
  "fields": [ {"name": "foo", "type": "int" },
              {"name": "bar", "type": "string"} ] },
 {"type": "record", "name": "subset",
  "fields": [ {"name": "foo", "type": "int" } ] }
]{code}
the data will be re-encoded as {{subset}} objects, silently effectively 
discarding the {{bar}} field.

This behavior seems fundamentally backwards-breaking _as unpatched_, but here's 
a way we could rewrite it to only affect union member selection: I could 
rewrite the patch to pass an extra {{strict}} optional (default {{False}}) 
value to validate, and then to use that {{strict=True}} value when doing 
union-member-selection.  This would, I believe, allow extra fields for simple 
records, but discard them when determining the correct member. 

Of course, someone might still be expecting to put things into Python unions 
with extra fields and depending on the schema to discard these, but I think 
anyone with that expected behavior would have encountered this bug already.
                
> Python: validate too permissive on records with extra fields
> ------------------------------------------------------------
>
>                 Key: AVRO-1343
>                 URL: https://issues.apache.org/jira/browse/AVRO-1343
>             Project: Avro
>          Issue Type: Bug
>          Components: python
>            Reporter: Jeremy Kahn
>            Assignee: Jeremy Kahn
>             Fix For: 1.7.5
>
>         Attachments: AVRO-1343-tests.patch, AVRO-1343-validate.patch
>
>
> Python's validator silently accepts (generic) records with extra fields and 
> considers them valid.
> For example, {{io.validate}} silently considers that the schema:
> {noformat}{"type": "record",
>  "name": "Test",
>  "fields": [{"name": "f", "type": "long"}]}
> {noformat}
> should accept records like:
> {noformat}{'f': 5, 'extra_field': "abc"}{noformat}
> but this is problematic.
> This is *especially* problematic for encoding unions, because internally the 
> Python serializer uses {{validate}} to find the appropriate schema with which 
> to encode a given object.
> In the current implementation, union schema selection is the *last* schema 
> that {{validate(schema, obj)}} returns {{True}} for.  If {{validate}} isn't 
> picky, this encoding will frequently guess wrong.
> I will attach two patches: one to the tests and one to the {{validate}} 
> function.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to