[
https://issues.apache.org/jira/browse/AVRO-973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13205620#comment-13205620
]
Douglas Kaminsky commented on AVRO-973:
---------------------------------------
While a good start for primitives specifically, tese union hint semantics are
insufficient to solve the larger underlying problem. Union hint semantics need
to be available for all types.
I have the same issue where we have several message types that extend from one
another on the Java side, but in Python it's impossible to serialize correctly.
This is generalized to any two types appearing in a single union whose "data
domain" intersects.
e.g. Java Code contains two classes: com.foo.Foo and com.foo.Bar
Avro schema specifies record type "Message" with field "event" : ["null",
"com.foo.Foo", "com.foo.Bar"]
When serializing "event" field of type "Message":
Does this validate against a NullSchema? False - index=-1
Does this validate against a "com.foo.Foo"? True - index=1, BREAK
It then serializes as a Foo and all Bar-unique fields are lost
There's no simple solution:
* If you keep the break, this problem occurs
* If you reverse the order of union traversal, you couple the behaviors in an
inappropriate way
* If you remove the break, you introduce an extremely inefficient (up to 255
validations) process to serialization (BTW, this process is already pretty
inefficient)
The best I came up with was to add union hints in the form of a wrapper class
and extension to the datum writer (attachment to follow). This mimics the Java
behavior of coupling the datum and its schema.
> Union behavior not consistent
> -----------------------------
>
> Key: AVRO-973
> URL: https://issues.apache.org/jira/browse/AVRO-973
> Project: Avro
> Issue Type: Bug
> Components: python
> Affects Versions: 1.6.1, 1.6.2
> Reporter: Gaurav Nanda
> Labels: patch
> Attachments: AVRO-973-patch-1.patch, AVRO-973-patch-2.patch,
> AVRO-973-patch-3.patch, test_unions.py
>
> Original Estimate: 0.25h
> Remaining Estimate: 0.25h
>
> Python's union does not respect the order in which type is specified.
> For following schema:
> {"type":"map","values":["int","long","float","double","string","boolean"]},
> an integer value is written as double, but it should respect the order in
> which types have been specified.
> Fixed Code (io.py):
> def write_union(self, writers_schema, datum, encoder):
> """
> A union is encoded by first writing a long value indicating
> the zero-based position within the union of the schema of its value.
> The value is then encoded per the indicated schema within the union.
> """
> # resolve union
> index_of_schema = -1
> for i, candidate_schema in enumerate(writers_schema.schemas):
> if validate(candidate_schema, datum):
> index_of_schema = i
> break // XXX Add break statement here XXX//
> if index_of_schema < 0: raise AvroTypeException(writers_schema, datum)
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira