[ 
https://issues.apache.org/jira/browse/AVRO-656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12906127#action_12906127
 ] 

Doug Cutting commented on AVRO-656:
-----------------------------------

Arguably we shouldn't worry so much.  If an implementation can't distinguish 
between string and bytes then it should not be expected to preserve that 
distinction.  All that's really required is that it write valid data.

If we accept this, then we can go with my first proposal above: records are the 
only type that can occur multiply in a union.  Implementations will read data 
into the highest fidelity representation they can, but an implementation that 
represents floats as doubles will not be able to always write exactly the data 
it reads when processing a [float,double] union.  Similarly, an implementation 
that represents enum symbols with strings might sometimes write one in place of 
the other.

Folks could be advised to order their unions to guard against this.  
Higher-precision numeric types should usually occur before lower-precision 
types.  Enum and fixed should usually occur before string and bytes.

For performance, it is reasonable to continue to prohibit multiple arrays and 
maps, since otherwise recursive validation would be required.  Similarly, we 
should update all implementations to use record names, rather than recursive 
validation.

> writing unions with multiple records, fixed or enums can choose wrong branch 
> -----------------------------------------------------------------------------
>
>                 Key: AVRO-656
>                 URL: https://issues.apache.org/jira/browse/AVRO-656
>             Project: Avro
>          Issue Type: Bug
>          Components: java
>    Affects Versions: 1.4.0
>            Reporter: Doug Cutting
>            Assignee: Doug Cutting
>         Attachments: AVRO-656.patch
>
>
> According to the specification, a union may contain multiple instances of a 
> named type, provided they have different names.  There are several bugs in 
> the Java implementation of this when writing data:
>  - for record, only the short-name of the record is checked, so the branch 
> for a record of the same name in a different namespace may be used by mistake
>  - for enum and fixed, the name of the record is not checked, so the first 
> enum or fixed in the union will always be assumed when writing.  in many 
> cases this may cause the wrong data to be written, potentially corrupting 
> output.
> This is not a regression.  This has never been implemented correctly by Java. 
>  Python and Ruby never check names, but rather perform a full, recursive 
> validation of content.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to