Hi. I'm writing my own Avro implementation in Rust for personal use. During 
this work I found a lot of issues in Avro 
spec (the list follows).

I send this mail not only to user and dev mailing lists of Avro, but also to 
Apache community list, Kafka list and 3 
semi-randomly chosen Materialize employees. Because I want to draw attention to 
this problems. I hope this wider 
community helps Avro fix their issues and possible will give necessary 
resources.

As well as I understand Avro is used in Kafka. And Kafka, according to their 
site, is used in "80% of all Fortune 100 
companies". So Avro is critical piece of infrastructure of humanity. It should 
be absolutely perfect (and so I list 
even very small bugs). But it is not perfect.

Some of items in this list are (small and big) bugs, some are typos, some are 
my objections to the design. Some are 
fixable while keeping compatibility, some are not. I don't want to spend my 
time to report them as separate bugs, but 
you can try to convince me to do so.

I created this list simply by reading the spec from end to end (I skipped 
sections on RPC and logical types). And I 
even didn't look at implementations!

I write this is hope to help Avro.

I think big audit of spec and its implementations should be done.

All line numbers apply to spec.xml from tag release-1.11.0 (i. e. 
https://github.com/apache/avro/blob/release-1.11.0/doc/src/content/xdocs/spec.xml
 ). As well as I understand this tag 
corresponds to currently published version at 
https://avro.apache.org/docs/current/spec.html .

So, here we go.

* [Opinion] [No line]. In Avro one have to define named records inside each 
other like so:

{ "type": "record", "name": "a", "fields": 
[{"name":"b","type":{"type":"record","name":"c",...}}] }

This is very unnatural. In popular programming languages one usually define 
named record next to each other, not one 
inside other. Such representation is not handy to deal programmatically. In my 
implementation I have to convert this 
representation to usual form "root type + list of named types" right after 
reading JSON and convert back just before 
writing.

* [Opinion] [No line]. In this list you will see a lot of questions on Avro 
schema (encoded as JSON). Good JSON schema 
( https://json-schema.org/ ) would resolve many of them

* [Seems to be bug] [Line 49]. "derived type name" is vague term. In fact, in 
whole spec phrase "type name" is used 
very vaguely. Sometimes it means strings like "record" and sometimes it means 
names of named types. I propose to define 
in very beginning of the spec terms for primitive types, terms for strings like 
"record" and terms for names of defined 
types. Here is one possible way to do this: name strings like "record" and 
"fixed" "type kinds" and never name them 
type names, thus reserving term "type name" to named types only (and possibly 
to primitive types).

This issue already caused problems: look, for example, to this problems with 
{"type":"record","name":"record",...}: 
https://lists.apache.org/thread/0wmgyx6z69gy07lvj9ndko75752b8cn2 .

* [Opinion] [Line 58]. There is no primitive type for unsigned 64 bit integers. 
Such type is present in languages such 
as C and Rust

* [Very theoretical bug, possible even security-related] [Line 435]. "The float 
is converted into a 32-bit integer 
using a method equivalent to Java's floatToIntBits and then encoded in 
little-endian format". If we click at provided 
link, we will see that this Java function does NaN normalization. I think NaN 
normalization is good thing. But I think 
this quite possible spec implementers overlooked this NaN normalization 
requirement. So I propose: write explicitly 
directly in Avro spec that NaN are normalized. Audit all Avro implementations: 
whether they actually implemented this 
requirement. Create tests, which will actually test this requirement.

Also I don't know whether bit pattern provided in that Java doc (0x7fc00000) is 
quiet NaN or signaling. If it is 
signaling, this is very bad.

As well as I understand if you will configure your FPU particular way than 
merely copying signaling NaN from one place 
to another will abort your program. So, if your FPU is configured certain way 
then feeding particular binary Avro data 
to a program can crash it. I. e. this is security problem. So a reader should 
be careful to check whether input data is 
signaling NaN *before* storing it in floating point registers.

I checked whether manipulating signaling NaN can actually crash a program in 
default settings in Windows and Linux. And 
it turned out that a program will not crash. Still I think signaling NaN should 
be handled carefully.

Write to spec that writers should normalize NaNs, that readers should reject 
non-normalized NaNs and that readers 
should be careful not to store incoming floating number to floating-point 
variable before its sanitizing. Write that 
this is security issue.

* [Opinion] [Line 68]. "unicode character sequence". As well as I understand 
Unicode character sequence means sequence 
of Unicode scalar values. Note that scalar value is not same thing as code 
point. Unfortunately, some people don't know 
this, so please write explicitly: "this is sequence of scalar values, not code 
points", to make sure implementations 
will be correct

* [Bug] [Line 71]. "Primitive types have no specified attributes". This is lie. 
At line 1527 you specify logical type 
based on primitive type int. Thus you specify particular meaning of attribute 
"logicalType" for primitive type "int". 
Be careful at your wording. The spec should be rock-solid

* [Opinion] [Line 96]. "aliases: a JSON array of strings, providing alternate 
names for this record (optional)". Is 
empty array allowed? :) Are duplicate aliases allowed? :) Yes, you may say this 
is nitpicking, but I don't think so. 
Avro has important place in our infrastructure, so everything is important. 
Think carefully whether empty list (and 
duplicates) is allowed everywhere in the spec where you see some kind of list. 
I think empty arrays (and duplicates) 
should be disallowed in this particular case. Because the more things we allow, 
the bigger is attack surface

* [Bug] [Line 98]. "fields: a JSON array, listing fields". How many fields 
allowed? Already reported by me at 
https://issues.apache.org/jira/browse/AVRO-3279

* [Bug] [Line 235]. "Unions". How many variants in union allowed? Already 
reported by me at 
https://issues.apache.org/jira/browse/AVRO-3280

* [Bug] [Line 101]. "name: a JSON string providing the name of the field 
(required), and". Word "and" usually placed 
immediately before last item in sequence. The text here looks like item "doc" 
was last, but then the text was edited 
not carefully. This is very stupid typographic issue which shows that authors 
are not careful about spec quality. Also, 
the spec is not consistent on placing dots after items (this applies to whole 
spec). Sometimes I see nothing in the end 
of item, sometimes "." and sometimes ";"

* [Opinion] [Line 106]. "default: A default value for..." What follows is 
essentially description of JSON 
representation of Avro datum (except for unions). So, you managed to put very 
important part of your spec directly into 
one paragraph into second level bullet point?!

* [Opinion] [Line 112]. "Default values for union fields correspond to the 
first schema in the union". This phrase is 
difference between JSON encoding for Avro data and JSON encoding for default 
field. And, of course, presence of this 
difference is design bug

* [Opinion] [Line 113]. "Default values for bytes and fixed fields are JSON 
strings, where Unicode code points 0-255 
are mapped to unsigned 8-bit byte values 0-255". Wat? This is very unnatural 
encoding. You misuse JSON string. They are 
for strings, they are not for binary data. You should use array of numbers 
instead. I. e. encode bytes 0x0f 0x02 as 
[15, 2]. Moreover, how you will encode null bytes? "\u0000", right? C programs 
have difficulties with such strings

* [Bug] [Line 123]. Okey, so longs are encoded as JSON integers. But what if 
given long is not a JSON-safe integer? As 
we know from https://datatracker.ietf.org/doc/html/rfc7159 integers outside of 
range [-(2**53)+1, (2**53)-1] are not 
JSON-safe

* [Bug] [Line 123]. Infinities and NaNs cannot be represented in JSON, despite 
they seems to be allowed by Avro spec. 
So, JSON representation of Avro data is incomplete

* [Bug] [Line 128]. "enum". What is enum? :) This term is not yet defined by 
spec. For unknown reasons you decided to 
insert essentially whole description of JSON representation of Avro data into 
one small paragraph even before type 
system is fully described. Please use terms only after their definition

* [Stupid bug] [Line 168]. "namespace". Namespace is not marked as optional or 
required. Do you ever read your spec?

* [Even more stupid bug] [Line 168]. "<em>namespace</em>". Word "namespace" is 
marked using <em>, not <code>, and we 
can see this in rendered version. This is very stupid typographic bug, which is 
immediately obvious to anybody reading 
this document, even to non-technical people

* [Bug] [Line 200]. "a single attribute". As we see in provided example, 
"default" is allowed, too. What is meaning of 
this "default" attribute? And how its meaning differs from meaning of "default" 
key in field description? (Same for 
maps)

* [Bug] [Line 238]. "declares a schema which may be either a null or string". 
Lie. Schema is ["null", "string"]. 
*Value* may be a null or string. Please check that you don't confuse types 
(schemas) with value through whole your spec

* [Very opinionated opinion :)] [Line 235]. I don't like your unions at all. 
I'm coming from languages like Haskell and 
Rust, where true sum types are supported. They are similar to your unions, but 
their alternatives are named. 
Alternatives are identified by their names, so there is no restriction on 
duplicate types. So there is no need for very 
unnatural restriction "Unions may not contain more than one schema with the 
same type, except for the named types 
record, fixed and enum"

* [Absolutely stupid bug] [Line 261]. "aliases: a JSON array of strings, 
providing alternate names for this enum". You 
mean "fixed", right? So, you copy-pasted section on enums? Do you ever read 
your spec from end to end at least one time?

* [Bug] [Line 265]. "size: an integer, specifying the number of bytes per 
value". Is zero allowed?

* [Bug] [Line 292]. "The null namespace may not be used in a dot-separated 
sequence of names". You defined previously 
null namespace as a empty string instead of *whole* namespace. I. e. null 
namespace is lack of namespace (i. e. lack of 
whole dot-separated sequence). So there is no sense in speaking on using null 
namespace in dot-separated sequence. You 
probably mean that one should not use empty string in dot-separated sequence

* [Bug] [Line 374]. "Deserializing data into a newer schema is accomplished by 
specifying an additional schema, the 
results of which are described in Schema Resolution". Term "additional schema" 
is vague here. I would say so: 
"Deserializing data into a newer schema is accomplished by using an algorithm 
described in Schema Resolution"

* [Bug] [Line 380]. "Therefore, it is possible, though not advisable, to read 
Avro data with a schema that does not..." 
The whole paragraph is very vague. At first reading I thought that it is about 
schema resolution. After several 
attempts to understand it I finally understood that the paragraph is about 
reading attempts without original writer 
schema available at all. I propose removing whole paragraph or rewriting it 
completely

* [Bug] [Line 385]. "For example, int and long are always serialized the same 
way". What this means? You probably mean 
that *same* int and long (i. e. int and long, which are numerically identical) 
serialized the same way.

* [Bug] [Line 413]. "null is written as zero bytes". The phrase is vague. Do 
you mean no bytes at all? Or null bytes, 
i. e. some undefined number of null bytes? (Of course, I understand that you 
mean the first variant, but I still don't 
like the phrase)

* [Bug] [Line 417]. "int and long values are written..." Is canonical (i. e. 
smallest) encoding of numbers required? 
Already reported by me at https://issues.apache.org/jira/browse/AVRO-3307

I still think canonical representations should be required. The more forms of 
encoding you allow the bigger is attack 
surface.

Also, it would be desirable property for binary representations be equal when 
data is equal. It would be good if you 
guarantee this property at least for some subset of schemas (and of course, you 
should write explicitly for which 
schemas the property is guaranteed). Non-canonical representations break this 
property

* [Bug] [Line 446]. "bytes of UTF-8". As well as I understand UTF-8 is sequence 
of encoded scalar values (don't confuse 
with code points). Unfortunately, not everybody knows this, and thus we see 
WTF-8 (i. e. encoding similar to UTF-8, but 
with standalone surrogates) available in places where proper UTF-8 should 
reside. So everywhere where the spec says 
"UTF-8" I propose to explicitly write that standalone surrogates are not 
allowed and that readers should fail if they 
find them (I prefer to place this sentence to introduction of the spec)

* [Bug] [Line 572]. "Currently for C/C++ implementations, the positions are 
practically an int, but theoretically a 
long". Wat? So, other implementations use int (as per spec), but C++ uses long, 
right? So, go fix C++ implementation to 
match spec and other implementations

* [Opinion] [Line 596]. "if its type is null, then it is encoded as a JSON 
null". There is no reasons to special-case 
nulls. This is additional requirement, which adds complexity to implementations 
without any reasons

* [*Real* bug] [Line 598]. "otherwise it is encoded as a JSON object..." It 
seems I found a real bug. :) Consider this 
schema:

[{"type":"record","name":"map","fields":[{"name":"b","type":"int"}]},{"type":"map","values":"int"}]

As well as I understand such schema fully allowed. Now consider this encoded 
value: {"map":{"b":0}}. What is it? Map or 
record named "map"?

* [Bug] [Line 677]. "data is ordered by ascending numeric value". What about 
NaNs?

* [Bug] [Line 682]. "compared lexicographically by Unicode code point". Replace 
with scalar values. UTF-8 consists of 
scalar values

* [Opinion] [Line 737]. "The 16-byte, randomly-generated sync marker for this 
file". I don't like this point. It 
implies that container files are usually not equal. Thus it is not possible to 
compare them bitwise to determine 
equality. So, in my Avro implementation I write null bytes instead of this 
marker (yes, this possibly means that my 
implementation is non-conforming)

* [Opinion] [Line 717]. There is no any marker for end of container file. Thus 
there is no way to determine whether all 
data was written

* [Bug] [Line 1186]. "string is promotable to bytes. bytes is promotable to 
string". Wat? How these values are promoted?

* [Bug] [Line 1153]. What implementation should do (when it does schema 
resolution)? It should first check that schemas 
match (and report any errors) and then read data? Or proceed straight to 
reading data? This is important distinction. 
For example, what happens when we attempt to read file container without data 
elements using schema resolution 
algorithm? (Are such container allowed, by the way?) In the first case scheme 
check should be performed. In the second 
such reading should always be successful.

If you think the first case is correct, then the section should describe 
algorithm for determining matching of schemas 
separately from algorithm of actual reading data

* [Bug] [Line 1308]. <<int instead of {"type":"int"}>>. You mean <<"int" 
instead of {"type":"int"}>>, right?

* [Bug] [Line 1331]. "replace any escaped characters". Any? What about "a\"b"? 
It is impossible to replace :)

----

Some notes about my task. I want to implement this: 
https://blogs.gnome.org/alexl/2012/08/10/rethinking-the-shell-pipeline/ . I. e. 
I want to have shell utils in Linux, 
which exchange some structured data. Here is how I chose format for 
representing that data.

* I want format to be binary, not textual, this rules out JSON, XML, etc
* I want format to be typed, this rules out CBOR etc
* I want format to have support for proper sum types (similar to Haskell's), 
this rules out Microsoft Bond. As well as 
I understand this also rules out using GVariants, proposed in above mentioned 
article. And this rules out Protobuf: 
Protobuf has support for sum types (they are named OneOf), but this OneOfs are 
always optional (speaking in Avro 
language: you always get ["null", "int", "bool"] instead of ["int", "bool"])
* I want format to have support for recursive types, this rules out Bare ( 
baremessages.org )

So, we have not so many formats left. Avro and possibly a few more. And I chose 
Avro. And I really like it. Because:

* It is very compact
* It has very elegant way to support schema evolution (as opposed to Protobuf, 
where fields are tagged, i. e. you trade 
space efficiency for schema evolution)
* It has container format with schema attached
* You don't need to write items count to container header (good for streaming)

So, Avro is simply *best* for my task. But then I discovered its problems 
(listed above). How it is happened that such 
good format has so bad spec? How it is happened that *best* format for this 
task happened to be so bad? What this says 
about our industry?

==
Askar Safin
http://safinaskar.com
https://sr.ht/~safinaskar
https://github.com/safinaskar

Reply via email to