Scott created AVRO-3029:
---------------------------

             Summary: Specification is a little ambiguous about where enum 
defaults should be defined which might be causing library differences
                 Key: AVRO-3029
                 URL: https://issues.apache.org/jira/browse/AVRO-3029
             Project: Apache Avro
          Issue Type: Improvement
          Components: java, python, ruby
    Affects Versions: 1.10.1
            Reporter: Scott


In the specification, an enum type can have a `default` attribute. At the same 
time, each field in a record can have a default. On top of that, the chart of 
example default values for fields includes enum and the example.

So, if I want to define a record with a enum field, where would I put the 
default? Do I define it like this:
{code:java}
{
    "type": "record",
    "name": "test",
    "fields": [
        {
            "name": "enum",
            "type": {
                "type": "enum",
                "name": "enum_field",
                "symbols": ["FOO", "BAR"],
            },
            "default": "FOO",
        },
    ],
}
{code}

Or like this:
{code:java}
{
    "type": "record",
    "name": "test",
    "fields": [
        {
            "name": "enum",
            "type": {
                "type": "enum",
                "name": "enum_field",
                "symbols": ["FOO", "BAR"],
                "default": "FOO",
            },
        },
    ],
}
{code}

I was confused, so I started looking for examples, but it seems like I'm not 
the only one confused about this because [this 
stackoverflow|https://stackoverflow.com/questions/62596990/avro-schema-evolution-with-enum-deserialization-crashes]
 and this Jira ticket put the default at the field level whereas this Jira 
ticket puts the default at the enum level.

So then I started looking at examples in the codebase. It seems like there's a 
[ruby test 
case|https://github.com/apache/avro/blob/7d1e63b219e6d0778bc57195152477adee97fcab/lang/ruby/test/test_schema.rb#L333-L338]
 and [java test 
case|https://github.com/apache/avro/blob/7d1e63b219e6d0778bc57195152477adee97fcab/lang/java/avro/src/test/java/org/apache/avro/FooBarSpecificRecord.java#L34]
 that put the default at the enum level.

Okay, solved, right? Since the test cases have the default at the enum level, 
that's where it should be... but then I tried to create a simple python script 
(since I'm a python user) to double check this, and it seems like the python 
library disagrees. Here's the example script that uses the default at the enum 
level:
{code:java}
import json
from io import BytesIO
import avro.schema
from avro.datafile import DataFileReader, DataFileWriter
from avro.io import DatumReader, DatumWriter

writer_schema = {
    "type": "record",
    "name": "test",
    "fields": [
        {
            "name": "foo",
            "type": "string"
        }
    ],
}

reader_schema = {
    "type": "record",
    "name": "test",
    "fields": [
        {
            "name": "foo",
            "type": "string"
        },
        {
            "name": "enum",
            "type": {
                "type": "enum",
                "name": "enum_field",
                "symbols": ["FOO", "BAR"],
                "default": "FOO",
            },
        },
    ],
}

w_schema = avro.schema.parse(json.dumps(writer_schema))
r_schema = avro.schema.parse(json.dumps(reader_schema))

bio = BytesIO()

writer = DataFileWriter(bio, DatumWriter(), w_schema)
writer.append({"foo": "bar"})
writer.flush()

bio.seek(0)

reader = DataFileReader(bio, DatumReader(w_schema, r_schema))
for record in reader:
    print(record)
{code}
But when I run that, I get an exception:
{code:java}
avro.io.SchemaResolutionException: No default value for field enum
Writer's Schema: {
  "type": "record",
  "name": "test",
  "fields": [
    {
      "type": "string",
      "name": "foo"
    }
  ]
}
Reader's Schema: {
  "type": "record",
  "name": "test",
  "fields": [
    {
      "type": "string",
      "name": "foo"
    },
    {
      "type": {
        "type": "enum",
        "default": "FOO",
        "name": "enum_field",
        "symbols": [
          "FOO",
          "BAR"
        ]
      },
      "name": "enum"
    }
  ]
}
{code}
And if I change the script to use a reader_schema that has the default on the 
field level like this:
{code:java}
reader_schema = {
    "type": "record",
    "name": "test",
    "fields": [
        {
            "name": "foo",
            "type": "string"
        },
        {
            "name": "enum",
            "type": {
                "type": "enum",
                "name": "enum_field",
                "symbols": ["FOO", "BAR"],
            },
            "default": "FOO",
        },
    ],
}
{code}
Then it works and prints out the record with the default value for the enum:
{code:java}
{'foo': 'bar', 'enum': 'FOO'}
{code}

I don't have a Java environment set up to try to run the same type of script in 
Java to verify that implementation, but I would assume based on the test case 
that it works exactly the opposite and expects the default at the enum level.

I think making the libraries consistent could cause massive breakages for 
whichever library doesn't currently conform to what the specification should be 
(which I'm honestly not sure based on how the spec is currently written). 
Therefore, I think it might be easiest to allow an enum's default to be defined 
at either the field level or the enum level. I maintain the `fastavro` library 
and the behavior there is the same as the avro python implementation and I 
would hate to have to force a massive breaking change like this on the users if 
the specification is updated to say that enum default values have to be defined 
at the enum level rather than the field level.

Please let me know your thoughts and thank you for taking the time to read this 
lengthy message.






 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to