Re: Resolving Reader Enumerations -- bug?

Doug Cutting Thu, 25 Oct 2018 10:19:20 -0700

When considering the breakiness of a change, we should consider other
implementations, not just Java.  So, for the first-vs-best match, what does
C, C++, etc. do?  I'd rather conform to the spec, strictly enforcing
first-match, but if most other implementations already do exactly what Java
does, then compatibility would argue that we change the spec.


According to git blame, the structure test you mention was added 4 years
ago in AVRO-1590.

https://issues.apache.org/jira/projects/AVRO/issues/AVRO-1590

At a glance, this looks like Postel's law.  The Java implementation accepts
things beyond what's required by the spec?

Doug





On Thu, Oct 25, 2018 at 10:01 AM Raymie Stata <[email protected]>
wrote:

> As with enums, I found what looks like a discrepancy between the spec
> and the implementation of resolution, this time for the case that the
> reader's schema is a union and the writer's is not, so we look for the
> "first matching field."
>
> The spec for this case (I call it the "reader-union case") is as
> follows: "The first schema in the reader's union that matches the
> writer's schema is recursively resolved against it.  If none match, an
> error is signalled."
>
> Note this says the FIRST schema that matches, not the BEST schema that
> matches.  Notice also that it doesn't say what should happen if the
> match is "imperfect," i.e., if it could "signal an error" on some (or
> all) of its data.
>
> Here is the code in question: https://bit.ly/2ONfl9B
>
> *** Primitive types: on line 460 the code first looks for an _exact_
> match of the schema.  Then, on line 492, it looks for a "promote-able
> match".  Thus, consider these two schemas:
>
>    Reader: ["long", "int"]
>    Writer: "int"
>
> The spec says that the "long" branch of the reader should be returned
> (_FIRST_ match).  But the implementation will pick the "int" branch
> (presumably the "BEST" match).  Which is right, the spec or the impl?
>
> I think we should change the implementation to return the FIRST match:
> it's perfectly functional, the code is simpler, and the spec is
> simpler.  On the other hand, while it's hard to imagine a program out
> there _depending_ on this behavior, strictly speaking this would not
> be a backward-compatible change.  If we want "bug-for-bug"
> compatibility, then the spec could be changed to say something like
> "for primitive types, an exact match is preferred over a promote-able
> one, but if an exact match isn't available, then the FIRST
> promote-able one will be used."
>
> (Note that the code doesn't pick the "best" promotable branch, e.g.,
> in the case of reading an underlying "int", it does not prefer a
> "long" over a "float", but rather takes whichever is first.  This is
> why I prefer changing the implementation, because the implementation
> is neither a "best" nor "first" match, but a mix of both.)
>
>
> *** Record types: on lines 469-471, the the reader's and writer's
> names "match exactly" (i.e.,
> writer.getFullName().equals(reader.getFullName()), then the record is
> considered a match, whether or not that match is "imperfect" (e.g.,
> because the writer is missing a field for which the reader has no
> default value).
>
> On lines 473 to 490, if the reader's and writer's names match "kind
> of" (i.e., if writer.getName().equals(reader.getName()), then it this
> match is "considered", but under the following circumstances:
>
>    STRUCTURE) If matching the fields of the two record schemas
> generates certain kind of errors (e.g., if the writer is missing a
> field without a default value in the reader, or if the reader tries to
> "promote" a field that's a string in the writer to an int in the
> reader), then this "kind of" match is rejected.  Note that no such
> condition is put in place when the full-names match: if the names
> match, then that is the branch that's taken no matter what deeper
> errors might occur when matching the underlying record schema.
>
>    PICK_LAST) If there is more than one "kind of" name match that
> passes the STRUCTURE test, then the _LAST_ of these from the union is
> picked, not the first.
>
> In this case, I think we have no choice but to change the impl to
> conform to the spec (which is to say), get rid of what the comments in
> the code calls a "structure match."
>
> I say this because doing anything based on the unqualified names of
> schemas matching is unprecedented in the Avro semantics.  (If we do
> this for records, why not for enum and fixed?)
>
> And also, the "STRUCTURE" test that is being done is unprincipled.
> For example, when matching two schemas, as mentioned above, this test
> will fail if the the reader's schema for a field tries to "promote" a
> string-valued field to an "int."  However, if the two schemas contain
> a field (with the same name) that's also a record schema, and that
> nested record-schema tries to "promote" a string-valued field to an
> "int", then no error is flagged and this schema will pass the
> STRUCTURE test.
>
> While changing the impl in this case is also backward compatible
> (fixing any bug in a library is "incompatible" to folks who depend on
> that buggy behavior), anyone relying on this inconsistency in the
> STRUCTURE test has a very fragile system on their hands.
>

Re: Resolving Reader Enumerations -- bug?

Reply via email to