[
https://issues.apache.org/jira/browse/AVRO-656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12906867#action_12906867
]
Doug Cutting commented on AVRO-656:
-----------------------------------
> That would be a major change in what the Union is and what you can do with it.
The specification is primarily concerned with (a) schema & protocol syntax; (b)
format of corresponding data. So, as long as an implementation produces and
consumes valid schemas and data, it's a conforming implementation. A
high-fidelity implementation can read and write data without alteration, but an
implementation that cannot write data exactly as read might still be both
useful and correctly implement the Avro specification.
> If this means that an implementation can't use a string directly for an enum,
> but instead uses sentinel objects or a container with a value string and name
> string, Isn't that OK?
Sure, that's okay. But currently Ruby, PHP and Python don't distinguish bytes,
enum and fixed at runtime. This is fine except in the case of a union that
contains these types. In that case, an application may end up treating a value
intended to be one type as a different type. That may be a problem for some
applications, and may not be for others. Hopefully someone will fix these
implementations, e.g., to wrap such union values. But I don't think in the
meantime we need to declare that these implementations are non-conforming or
change the spec. Rather we should document the limitation and file bugs to
improve the implementations.
A primary question of this issue is whether to continue to permit multiple
enums and fixed in a union, distinguished by name. No implementation takes
advantage of this today, and it might make implementations simpler to drop
this, permitting only a single enum and fixed per union. So far, no one has
presented a use case for this feature.
I'd also like to see Ruby, Python and PHP improve their union handling by
avoiding recursive validation. If they add a name to each record instance this
is easy, and better implements the spirit of the specification. Adding
wrappers for enum, fixed and bytes would also be good, but is a bigger change.
> writing unions with multiple records, fixed or enums can choose wrong branch
> -----------------------------------------------------------------------------
>
> Key: AVRO-656
> URL: https://issues.apache.org/jira/browse/AVRO-656
> Project: Avro
> Issue Type: Bug
> Components: java
> Affects Versions: 1.4.0
> Reporter: Doug Cutting
> Assignee: Doug Cutting
> Attachments: AVRO-656.patch
>
>
> According to the specification, a union may contain multiple instances of a
> named type, provided they have different names. There are several bugs in
> the Java implementation of this when writing data:
> - for record, only the short-name of the record is checked, so the branch
> for a record of the same name in a different namespace may be used by mistake
> - for enum and fixed, the name of the record is not checked, so the first
> enum or fixed in the union will always be assumed when writing. in many
> cases this may cause the wrong data to be written, potentially corrupting
> output.
> This is not a regression. This has never been implemented correctly by Java.
> Python and Ruby never check names, but rather perform a full, recursive
> validation of content.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.