Re: [External] Gigantic list of Avro spec issues

Shannon Carey Tue, 15 Feb 2022 15:15:32 -0800

I am not an Avro expert, but here are some thoughts:

In general, I think you should submit a PR or PRs to suggest improvements, 
unless you are unsure what change should be made. That seems more productive 
than a list of things that you think are "stupid".

Another possibility would be putting together some of these suggestions into
the contributor guidance
https://cwiki.apache.org/confluence/display/AVRO/How+To+Contribute or maybe
some kind of spec code review checklist.

You might be interested in using AVDL as a more natural way of defining types
than AVSC. Also, you can combine multiple ACSC's together into the same context
in order to normalize your type declarations. If you haven't already noticed,
you can refer to an existing record type by its fully qualified name.

I agree with your critique of "derived type". It should probably say "complex
type" instead.

"Type name" does not seem particularly confusing to me. It goes in the
"typeName" part of "type": "typeName", and it's either one of the primitive
type names, "record", "enum", "array", "map", or "fixed".

Regarding unsigned 64 bit integers, that's true, but languages like Java and
Python don't have it, so you'd have to figure out what those other languages
should do.

Regarding signaling NaN, the Java spec states, "The Java Virtual Machine has no
signaling NaN value." But it makes sense that this distinction is important for
other languages.

> "Default values for union fields correspond to the first schema in the
> union". This phrase is
difference between JSON encoding for Avro data and JSON encoding for
default field. And, of course, presence of this
difference is design bug

I don't follow you, maybe clarify in case other people have the same problem?

> "The null namespace may not be used in a dot-separated sequence of names".
> You defined previously
null namespace as a empty string instead of *whole* namespace. I. e. null
namespace is lack of namespace (i. e. lack of
whole dot-separated sequence). So there is no sense in speaking on using
null namespace in dot-separated sequence. You
probably mean that one should not use empty string in dot-separated sequence

I'm not sure you have that exactly right. The null namespace isn't defined as
an empty string, the empty string is used to indicate the null namespace. Null
namespace is not necessarily just lack of namespace. If you omit "namespace"
and omit the namespace in "name", then it uses the namespace of the enclosing
definition, and only uses the null namespace if there is no enclosing
namespace. However, I agree that it could be reworded to say that each <name>
component of a namespace must be a non-empty string.

> * [Bug] [Line 374]. "Deserializing data into a newer schema is
> accomplished by specifying an additional schema, the
results of which are described in Schema Resolution". Term "additional
schema" is vague here. I would say so:
"Deserializing data into a newer schema is accomplished by using an
algorithm described in Schema Resolution"

I wrote that in https://github.com/apache/avro/pull/91/files in an effort to
improve the spec after suffering frustration similar to what you may be
experiencing :) Though it did go through some code review comments where my
questions were not fully answered, and resulted in the wording you see now in
an effort to explain without misleading or making inaccurate claims. That
paragraph begins by explaining the importance and use of the "writer's schema".
The "additional schema" is referring to what I usually call the "reader's
schema". I believe that a better rewording than what you suggest would be
something like, "Deserializing data into a newer schema is accomplished by
providing that newer schema as the reader's schema, which is utilized as
described in Schema Resolution."

> "Therefore, it is possible, though not advisable, to read Avro data with a
> schema that does not..."
The whole paragraph is very vague. At first reading I thought that it is
about schema resolution. After several
attempts to understand it I finally understood that the paragraph is about
reading attempts without original writer
schema available at all. I propose removing whole paragraph or rewriting it
completely

I mostly agree. That paragraph was re-worded in reaction to this comment
https://github.com/apache/avro/pull/91/commits/1024216a68cdb7ec54ad3d126953755314966588#r77443864
and this comment
https://github.com/apache/avro/pull/91/commits/69895cdf468eb75890a3f9dd9a0d89140905417f#r78588984
It doesn't make any sense to me that anyone should try to read Avro using a
deserializing writer's schema that has the same type structure as the original
writer's schema but not the same canonical form. It really doesn't make any
sense to me that anyone would want to use anything but the original writer's
schema. Also, the section about "Parsing Canonical Form for Schemas" is very
specific that it is the definition of what it means for the reader to have "the
same" schema as a writer. So I don't know why we have to confuse the issue by
suggesting strange ways of using Avro, or that there are other ways for schemas
to be "the same". But, I was trying to satisfy the code review comments because
I thought the documentation really needed clarifications about the role of the
reader & writer schema, and I didn't want it to be rejected.

I believe the paragraph does provide useful information about the serialization
& deserialization approach, though, which helps clarify the physical format of
the data in relation to the schema, and helps users understand the implications
of schema changes.

> int and long are always serialized the same way, so an int could be
> deserialized as a long

The spec mentions, "int and long values are written using variable-length
zig-zag coding." In other words, an int, once serialized, is indistinguishable
from a long. Therefore, even if the writer schema had it as an "int", if your
writer schema has it as a "long" it will still work: you will get back the
correct value. Since they are both serialized the same way, an int can be
deserialized as a long. If int and long each had their own serialization
approach, that wouldn't be the case.

> "null is written as zero bytes". The phrase is vague.
Doesn't seem very vague to me. Zero bytes. How would you reword it?

That's all I have time for,

Shannon

On 2/12/22, 5:22 PM, "Askar Safin" <safinas...@mail.ru.INVALID> wrote:

Hi. I'm writing my own Avro implementation in Rust for personal use. During
this work I found a lot of issues in Avro
spec (the list follows).

I send this mail not only to user and dev mailing lists of Avro, but also
to Apache community list, Kafka list and 3
semi-randomly chosen Materialize employees. Because I want to draw
attention to this problems. I hope this wider
community helps Avro fix their issues and possible will give necessary
resources.

As well as I understand Avro is used in Kafka. And Kafka, according to
their site, is used in "80% of all Fortune 100
companies". So Avro is critical piece of infrastructure of humanity. It
should be absolutely perfect (and so I list
even very small bugs). But it is not perfect.

Some of items in this list are (small and big) bugs, some are typos, some
are my objections to the design. Some are
fixable while keeping compatibility, some are not. I don't want to spend my
time to report them as separate bugs, but
you can try to convince me to do so.

I created this list simply by reading the spec from end to end (I skipped
sections on RPC and logical types). And I
even didn't look at implementations!

I write this is hope to help Avro.

I think big audit of spec and its implementations should be done.

All line numbers apply to spec.xml from tag release-1.11.0 (i. e.

https://github.com/apache/avro/blob/release-1.11.0/doc/src/content/xdocs/spec.xml
). As well as I understand this tag
corresponds to currently published version at
https://avro.apache.org/docs/current/spec.html .

So, here we go.

* [Opinion] [No line]. In Avro one have to define named records inside each
other like so:

{ "type": "record", "name": "a", "fields":
[{"name":"b","type":{"type":"record","name":"c",...}}] }

This is very unnatural. In popular programming languages one usually define
named record next to each other, not one
inside other. Such representation is not handy to deal programmatically. In
my implementation I have to convert this
representation to usual form "root type + list of named types" right after
reading JSON and convert back just before
writing.

* [Opinion] [No line]. In this list you will see a lot of questions on Avro
schema (encoded as JSON). Good JSON schema
( https://json-schema.org/ ) would resolve many of them

* [Seems to be bug] [Line 49]. "derived type name" is vague term. In fact,
in whole spec phrase "type name" is used
very vaguely. Sometimes it means strings like "record" and sometimes it
means names of named types. I propose to define
in very beginning of the spec terms for primitive types, terms for strings
like "record" and terms for names of defined
types. Here is one possible way to do this: name strings like "record" and
"fixed" "type kinds" and never name them
type names, thus reserving term "type name" to named types only (and
possibly to primitive types).

This issue already caused problems: look, for example, to this problems
with {"type":"record","name":"record",...}:
https://lists.apache.org/thread/0wmgyx6z69gy07lvj9ndko75752b8cn2 .

* [Opinion] [Line 58]. There is no primitive type for unsigned 64 bit
integers. Such type is present in languages such
as C and Rust

* [Very theoretical bug, possible even security-related] [Line 435]. "The
float is converted into a 32-bit integer
using a method equivalent to Java's floatToIntBits and then encoded in
little-endian format". If we click at provided
link, we will see that this Java function does NaN normalization. I think
NaN normalization is good thing. But I think
this quite possible spec implementers overlooked this NaN normalization
requirement. So I propose: write explicitly
directly in Avro spec that NaN are normalized. Audit all Avro
implementations: whether they actually implemented this
requirement. Create tests, which will actually test this requirement.

Also I don't know whether bit pattern provided in that Java doc
(0x7fc00000) is quiet NaN or signaling. If it is
signaling, this is very bad.

As well as I understand if you will configure your FPU particular way than
merely copying signaling NaN from one place
to another will abort your program. So, if your FPU is configured certain
way then feeding particular binary Avro data
to a program can crash it. I. e. this is security problem. So a reader
should be careful to check whether input data is
signaling NaN *before* storing it in floating point registers.

I checked whether manipulating signaling NaN can actually crash a program
in default settings in Windows and Linux. And
it turned out that a program will not crash. Still I think signaling NaN
should be handled carefully.

Write to spec that writers should normalize NaNs, that readers should
reject non-normalized NaNs and that readers
should be careful not to store incoming floating number to floating-point
variable before its sanitizing. Write that
this is security issue.

* [Opinion] [Line 68]. "unicode character sequence". As well as I
understand Unicode character sequence means sequence
of Unicode scalar values. Note that scalar value is not same thing as code
point. Unfortunately, some people don't know
this, so please write explicitly: "this is sequence of scalar values, not
code points", to make sure implementations
will be correct

* [Bug] [Line 71]. "Primitive types have no specified attributes". This is
lie. At line 1527 you specify logical type
based on primitive type int. Thus you specify particular meaning of
attribute "logicalType" for primitive type "int".
Be careful at your wording. The spec should be rock-solid

* [Opinion] [Line 96]. "aliases: a JSON array of strings, providing
alternate names for this record (optional)". Is
empty array allowed? :) Are duplicate aliases allowed? :) Yes, you may say
this is nitpicking, but I don't think so.
Avro has important place in our infrastructure, so everything is important.
Think carefully whether empty list (and
duplicates) is allowed everywhere in the spec where you see some kind of
list. I think empty arrays (and duplicates)
should be disallowed in this particular case. Because the more things we
allow, the bigger is attack surface

* [Bug] [Line 98]. "fields: a JSON array, listing fields". How many fields
allowed? Already reported by me at
https://issues.apache.org/jira/browse/AVRO-3279

* [Bug] [Line 235]. "Unions". How many variants in union allowed? Already
reported by me at
https://issues.apache.org/jira/browse/AVRO-3280

* [Bug] [Line 101]. "name: a JSON string providing the name of the field
(required), and". Word "and" usually placed
immediately before last item in sequence. The text here looks like item
"doc" was last, but then the text was edited
not carefully. This is very stupid typographic issue which shows that
authors are not careful about spec quality. Also,
the spec is not consistent on placing dots after items (this applies to
whole spec). Sometimes I see nothing in the end
of item, sometimes "." and sometimes ";"

* [Opinion] [Line 106]. "default: A default value for..." What follows is
essentially description of JSON
representation of Avro datum (except for unions). So, you managed to put
very important part of your spec directly into
one paragraph into second level bullet point?!

* [Opinion] [Line 112]. "Default values for union fields correspond to the
first schema in the union". This phrase is
difference between JSON encoding for Avro data and JSON encoding for
default field. And, of course, presence of this
difference is design bug

* [Opinion] [Line 113]. "Default values for bytes and fixed fields are JSON
strings, where Unicode code points 0-255
are mapped to unsigned 8-bit byte values 0-255". Wat? This is very
unnatural encoding. You misuse JSON string. They are
for strings, they are not for binary data. You should use array of numbers
instead. I. e. encode bytes 0x0f 0x02 as
[15, 2]. Moreover, how you will encode null bytes? "\u0000", right? C
programs have difficulties with such strings

* [Bug] [Line 123]. Okey, so longs are encoded as JSON integers. But what
if given long is not a JSON-safe integer? As
we know from https://datatracker.ietf.org/doc/html/rfc7159 integers outside
of range [-(2**53)+1, (2**53)-1] are not
JSON-safe

* [Bug] [Line 123]. Infinities and NaNs cannot be represented in JSON,
despite they seems to be allowed by Avro spec.
So, JSON representation of Avro data is incomplete

* [Bug] [Line 128]. "enum". What is enum? :) This term is not yet defined
by spec. For unknown reasons you decided to
insert essentially whole description of JSON representation of Avro data
into one small paragraph even before type
system is fully described. Please use terms only after their definition

* [Stupid bug] [Line 168]. "namespace". Namespace is not marked as optional
or required. Do you ever read your spec?

* [Even more stupid bug] [Line 168]. "namespace". Word "namespace"
is marked using , not <code>, and we
can see this in rendered version. This is very stupid typographic bug,
which is immediately obvious to anybody reading
this document, even to non-technical people

* [Bug] [Line 200]. "a single attribute". As we see in provided example,
"default" is allowed, too. What is meaning of
this "default" attribute? And how its meaning differs from meaning of
"default" key in field description? (Same for
maps)

* [Bug] [Line 238]. "declares a schema which may be either a null or
string". Lie. Schema is ["null", "string"].
*Value* may be a null or string. Please check that you don't confuse types
(schemas) with value through whole your spec

* [Very opinionated opinion :)] [Line 235]. I don't like your unions at
all. I'm coming from languages like Haskell and
Rust, where true sum types are supported. They are similar to your unions,
but their alternatives are named.
Alternatives are identified by their names, so there is no restriction on
duplicate types. So there is no need for very
unnatural restriction "Unions may not contain more than one schema with the
same type, except for the named types
record, fixed and enum"

* [Absolutely stupid bug] [Line 261]. "aliases: a JSON array of strings,
providing alternate names for this enum". You
mean "fixed", right? So, you copy-pasted section on enums? Do you ever read
your spec from end to end at least one time?

* [Bug] [Line 265]. "size: an integer, specifying the number of bytes per
value". Is zero allowed?

* [Bug] [Line 292]. "The null namespace may not be used in a dot-separated
sequence of names". You defined previously
null namespace as a empty string instead of *whole* namespace. I. e. null
namespace is lack of namespace (i. e. lack of
whole dot-separated sequence). So there is no sense in speaking on using
null namespace in dot-separated sequence. You
probably mean that one should not use empty string in dot-separated sequence

* [Bug] [Line 374]. "Deserializing data into a newer schema is accomplished
by specifying an additional schema, the
results of which are described in Schema Resolution". Term "additional
schema" is vague here. I would say so:
"Deserializing data into a newer schema is accomplished by using an
algorithm described in Schema Resolution"

* [Bug] [Line 380]. "Therefore, it is possible, though not advisable, to
read Avro data with a schema that does not..."
The whole paragraph is very vague. At first reading I thought that it is
about schema resolution. After several
attempts to understand it I finally understood that the paragraph is about
reading attempts without original writer
schema available at all. I propose removing whole paragraph or rewriting it
completely

* [Bug] [Line 385]. "For example, int and long are always serialized the
same way". What this means? You probably mean
that *same* int and long (i. e. int and long, which are numerically
identical) serialized the same way.

* [Bug] [Line 413]. "null is written as zero bytes". The phrase is vague.
Do you mean no bytes at all? Or null bytes,
i. e. some undefined number of null bytes? (Of course, I understand that
you mean the first variant, but I still don't
like the phrase)

* [Bug] [Line 417]. "int and long values are written..." Is canonical (i.
e. smallest) encoding of numbers required?
Already reported by me at https://issues.apache.org/jira/browse/AVRO-3307

I still think canonical representations should be required. The more forms
of encoding you allow the bigger is attack
surface.

Also, it would be desirable property for binary representations be equal
when data is equal. It would be good if you
guarantee this property at least for some subset of schemas (and of course,
you should write explicitly for which
schemas the property is guaranteed). Non-canonical representations break
this property

* [Bug] [Line 446]. "bytes of UTF-8". As well as I understand UTF-8 is
sequence of encoded scalar values (don't confuse
with code points). Unfortunately, not everybody knows this, and thus we see
WTF-8 (i. e. encoding similar to UTF-8, but
with standalone surrogates) available in places where proper UTF-8 should
reside. So everywhere where the spec says
"UTF-8" I propose to explicitly write that standalone surrogates are not
allowed and that readers should fail if they
find them (I prefer to place this sentence to introduction of the spec)

* [Bug] [Line 572]. "Currently for C/C++ implementations, the positions are
practically an int, but theoretically a
long". Wat? So, other implementations use int (as per spec), but C++ uses
long, right? So, go fix C++ implementation to
match spec and other implementations

* [Opinion] [Line 596]. "if its type is null, then it is encoded as a JSON
null". There is no reasons to special-case
nulls. This is additional requirement, which adds complexity to
implementations without any reasons

* [*Real* bug] [Line 598]. "otherwise it is encoded as a JSON object..." It
seems I found a real bug. :) Consider this
schema:

[{"type":"record","name":"map","fields":[{"name":"b","type":"int"}]},{"type":"map","values":"int"}]

As well as I understand such schema fully allowed. Now consider this
encoded value: {"map":{"b":0}}. What is it? Map or
record named "map"?

* [Bug] [Line 677]. "data is ordered by ascending numeric value". What
about NaNs?

* [Bug] [Line 682]. "compared lexicographically by Unicode code point".
Replace with scalar values. UTF-8 consists of
scalar values

* [Opinion] [Line 737]. "The 16-byte, randomly-generated sync marker for
this file". I don't like this point. It
implies that container files are usually not equal. Thus it is not possible
to compare them bitwise to determine
equality. So, in my Avro implementation I write null bytes instead of this
marker (yes, this possibly means that my
implementation is non-conforming)

* [Opinion] [Line 717]. There is no any marker for end of container file.
Thus there is no way to determine whether all
data was written

* [Bug] [Line 1186]. "string is promotable to bytes. bytes is promotable to
string". Wat? How these values are promoted?

* [Bug] [Line 1153]. What implementation should do (when it does schema
resolution)? It should first check that schemas
match (and report any errors) and then read data? Or proceed straight to
reading data? This is important distinction.
For example, what happens when we attempt to read file container without
data elements using schema resolution
algorithm? (Are such container allowed, by the way?) In the first case
scheme check should be performed. In the second
such reading should always be successful.

If you think the first case is correct, then the section should describe
algorithm for determining matching of schemas
separately from algorithm of actual reading data

* [Bug] [Line 1308]. <<int instead of {"type":"int"}>>. You mean <<"int"
instead of {"type":"int"}>>, right?

* [Bug] [Line 1331]. "replace any escaped characters". Any? What about
"a\"b"? It is impossible to replace :)

----

Some notes about my task. I want to implement this:
https://blogs.gnome.org/alexl/2012/08/10/rethinking-the-shell-pipeline/ .
I. e. I want to have shell utils in Linux,
which exchange some structured data. Here is how I chose format for
representing that data.

* I want format to be binary, not textual, this rules out JSON, XML, etc
* I want format to be typed, this rules out CBOR etc
* I want format to have support for proper sum types (similar to
Haskell's), this rules out Microsoft Bond. As well as
I understand this also rules out using GVariants, proposed in above
mentioned article. And this rules out Protobuf:
Protobuf has support for sum types (they are named OneOf), but this OneOfs
are always optional (speaking in Avro
language: you always get ["null", "int", "bool"] instead of ["int", "bool"])
* I want format to have support for recursive types, this rules out Bare (
baremessages.org )

So, we have not so many formats left. Avro and possibly a few more. And I
chose Avro. And I really like it. Because:

* It is very compact
* It has very elegant way to support schema evolution (as opposed to
Protobuf, where fields are tagged, i. e. you trade
space efficiency for schema evolution)
* It has container format with schema attached
* You don't need to write items count to container header (good for
streaming)

So, Avro is simply *best* for my task. But then I discovered its problems
(listed above). How it is happened that such
good format has so bad spec? How it is happened that *best* format for this
task happened to be so bad? What this says
about our industry?

==
Askar Safin
http://safinaskar.com
https://sr.ht/~safinaskar
https://github.com/safinaskar

Re: [External] Gigantic list of Avro spec issues

Reply via email to