[
https://issues.apache.org/jira/browse/AVRO-816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13065557#comment-13065557
]
Scott Carey commented on AVRO-816:
----------------------------------
{quote}
I agree that since "long" and "int" unify to "long" that ["null", "long"] and
["null", "int"] should ideally unify to ["null", "long"].
{quote}
I think this is a different operation than 'unify'. It creates a schema that
can resolve both source schemas, but information is lost. If we say that the
result is ["null", "long"] we are in effect saying that the two schemas:
["null", "long"] and ["null", "long", "int"] are the same. We might as well
not allow the schema ["long", "int", "null"] and force only one numeric type in
a union.
If there is use for a schema ["long", "int", "null"] (and I argue there is),
then the union of ["null", "long"] and ["null", "int"] should be ["long",
"int", "null"].
The notion of making a union of two schemas and resolving them to a schema that
can read both is distinct.
Type promotion from int -> long does not lose data, but promotion from int ->
float does. The result is an approximation with (much) fewer significant bits.
For example
{code}
System.out.println(String.format("%10d", Integer.MAX_VALUE));
System.out.println(String.format("%10.1f", (float)Integer.MAX_VALUE));
System.out.println(String.format("%10d", Integer.MAX_VALUE - 63));
System.out.println(String.format("%10.1f", (float)(Integer.MAX_VALUE -
63)));
{code}
prints:
{noformat}
2147483647
2147483648.0
2147483584
2147483648.0
{noformat}
So on one hand we want to ask for a schema that "can read both of these", and
on the other we have a schema that "can read both at maximum fidelity".
> Schema Comparison Utils
> -----------------------
>
> Key: AVRO-816
> URL: https://issues.apache.org/jira/browse/AVRO-816
> Project: Avro
> Issue Type: New Feature
> Components: java
> Reporter: Joe Crobak
> Assignee: Joe Crobak
> Priority: Minor
> Attachments: AVRO-816.patch, AVRO-816.patch, AVRO-816.patch,
> AVRO-816.patch
>
>
> From my post on the mailing list, and Doug's response:
> {quote}
> On 05/05/2011 10:29 AM, Joe Crobak wrote:
> > We've recently come across a situation where we have two data files with
> > different schemas that we'd like to process together using
> > GenericDatumReader. One schema is promotable to the other, but not vice
> > versa. We'd like to programmatically determine which of the schemas to
> > use. I did a brief look through javadoc and tests, and I couldn't find
> > any examples of checking if one schema is promotable to the other. Has
> > anyone else come across this?
> >
> > For some context, we're considering patching AvroStorage [1] to remove
> > the assumption that all files have the same schema. In our case, our
> > schema has evolved in that a field that was an int was promoted to a long.
> A boolean method that tells you if one schema is promotable to another
> would work in this case, but would not help in cases where, e.g.,
> different fields had changed in different versions. For example, in
> branched development, two branches might each add a distinct symbol to
> an enum. So I think you might be better off with a method that, given
> two schemas, returns their superset, a schema that can read data written
> by either.
> Such a method does not yet exist in Avro, but should not be difficult to
> add. Please file an issue in Jira if this sounds of interest.
> Doug
> {quote}
> I think it would be useful to have both of the methods that Doug mentioned in
> some sort of schema utils class.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira