[
https://issues.apache.org/jira/browse/AVRO-3313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Valentin updated AVRO-3313:
---------------------------
Description:
I wanted to use the avro enums and evolve my schema over time by adding the
values.
>From the doc it says :
{code:java}
default: A default value for this enumeration, used during resolution when the
reader encounters a symbol from the writer that isn't defined in the reader's
schema (optional). The value provided here must be a JSON string that's a
member of the symbols array. See documentation on schema resolution for how
this gets used. {code}
And the section of the documentation about schema resolution says :
[https://avro.apache.org/docs/current/spec.html#Schema+Resolution]
{code:java}
if both are enums:
if the writer's symbol is not present in the reader's enum and the reader has a
default value, then that value is used, otherwise an error is signalled. {code}
This feature has been introduced in avro 1.9.0 with this issue :
[https://avro.apache.org/docs/current/spec.html#Enums]
*However I have found that it doesn't work at all like the specification says.*
Here is an example.
If I have a schema used for writing in version 1.
It has two symbols (A and B) and specify to default to symbol A.
{code:java}
{
"type": "record",
"name": "RecordA",
"fields":
[
{
"name": "fieldA",
"type":
{
"type": "enum",
"name": "Enum1",
"symbols":
[
"A",
"B"
]
},
"default": "A"
}
]
} {code}
Later when the schema needs a evolvution on the writer, we add a new symbol (C)
and publish a new schema in version 2.
And the default value is still A.
{code:java}
{
"type": "record",
"name": "RecordA",
"fields":
[
{
"name": "fieldA",
"type":
{
"type": "enum",
"name": "Enum1",
"symbols":
[
"A",
"B",
"C"
]
},
"default": "A"
}
]
} {code}
According to the documentation on the reader side with the old schema in
version 1, we should be able to deserialize a payload containing an enum value
of C that was generated by the writer side with the schema in version 2. Sinc
the value C is unknown by the reader it should be deserialized as A.
Again as the doc says :
{code:java}
A default value for this enumeration, used during resolution when the reader
encounters a symbol from the writer that isn't defined in the reader's schema
{code}
The issue here is either the documentation is wrong or the avro deserialization
code is wrong. Since this was an intented feature I assume that this is a bug
and the code is wrong.
I have forked the repository and created a test to demonstrate the issue :
[https://github.com/idkw/avro/commit/7d36203c137aa6a728d5b85b87969a3f743b45ee]
The test should verify that the reader side using the old schema should
deserialize the value A when receiving a value C. However it fails with the
exception `org.apache.avro.AvroTypeException: No match for C`
{code:java}
@Test public void
enumRecordWithExtendedSchemaCanBeReadIfNewValuesAreUsedUsingDefault() throws
Exception {
Schema readerSchemaV1 = ENUM_AB_RECORD_DEFAULT_A;
Schema writerSchemaV2 = ENUM_ABC_RECORD_DEFAULT_A;
Record record = defaultRecordWithSchema(
writerSchemaV2,
FIELD_A,
new EnumSymbol(writerSchemaV2, "C")
);
byte[] encoded = encodeGenericBlob(record);
Record decodedRecord = decodeGenericBlob(
readerSchemaV1,
writerSchemaV2,
encoded
);
Assert.assertEquals("A", decodedRecord.get(FIELD_A).toString());
} {code}
It should not fail but deserialize to "A".
was:
I wanted to use the avro enums and evolve my schema over time by adding the
values.
>From the doc it says :
{code:java}
default: A default value for this enumeration, used during resolution when the
reader encounters a symbol from the writer that isn't defined in the reader's
schema (optional). The value provided here must be a JSON string that's a
member of the symbols array. See documentation on schema resolution for how
this gets used. {code}
And the section of the documentation about schema resolution says :
[https://avro.apache.org/docs/current/spec.html#Schema+Resolution]
{code:java}
if both are enums:
if the writer's symbol is not present in the reader's enum and the reader has a
default value, then that value is used, otherwise an error is signalled. {code}
This feature has been introduced in avro 1.9.0 with this issue :
[https://avro.apache.org/docs/current/spec.html#Enums]
*However I have found that it doesn't work at all like the specification says.*
Here is an example.
If I have a schema used for writing in version 1.
It has two symbols (A and B) and specify to default to symbol A.
{code:java}
{
"type": "record",
"name": "RecordA",
"fields":
[
{
"name": "fieldA",
"type":
{
"type": "enum",
"name": "Enum1",
"symbols":
[
"A",
"B"
]
},
"default": "A"
}
]
} {code}
Later when the schema needs a evolvution on the writer, we add a new symbol (C)
and publish a new schema in version 2.
And the default value is still A.
{code:java}
{
"type": "record",
"name": "RecordA",
"fields":
[
{
"name": "fieldA",
"type":
{
"type": "enum",
"name": "Enum1",
"symbols":
[
"A",
"B",
"C"
]
},
"default": "A"
}
]
} {code}
According to the documentation on the reader side with the old schema in
version 1, we should be able to deserialize a payload containing an enum value
of C that was generated by the writer side with the schema in version 2. Sinc
the value C is unknown by the reader it should be deserialized as A.
Again as the doc says :
{code:java}
A default value for this enumeration, used during resolution when the reader
encounters a symbol from the writer that isn't defined in the reader's schema
{code}
The issue here is either the documentation is wrong or the avro deserialization
code is wrong. Since this was an intented feature I assume that this is a bug
and the code is wrong.
I have forked the repository and created a test to demonstrate the issue :
[https://github.com/idkw/avro/commit/7d36203c137aa6a728d5b85b87969a3f743b45ee]
The test should verify that the reader side using the old schema should
deserialize the value A when receiving a value C. However it fails with the
exception `org.apache.avro.AvroTypeException: No match for C`
{code:java}
@Test public void
enumRecordWithExtendedSchemaCanBeReadIfNewValuesAreUsedUsingDefault() throws
Exception {
Schema readerSchemaV1 = ENUM_AB_RECORD_DEFAULT_A;
Schema writerSchemaV2 = ENUM_ABC_RECORD_DEFAULT_A;
Record record = defaultRecordWithSchema(
writerSchemaV2,
FIELD_A,
new EnumSymbol(writerSchemaV2, "C")
);
byte[] encoded = encodeGenericBlob(record);
Record decodedRecord = decodeGenericBlob(
readerSchemaV1,
writerSchemaV2,
encoded
);
Assert.assertEquals("A", decodedRecord.get(FIELD_A).toString());
} {code}
> enum default value to allow deserializer to deserialize to when encountering
> new enum symbols doesn't work
> ----------------------------------------------------------------------------------------------------------
>
> Key: AVRO-3313
> URL: https://issues.apache.org/jira/browse/AVRO-3313
> Project: Apache Avro
> Issue Type: Bug
> Components: java
> Affects Versions: 1.9.0, 1.10.0, 1.9.1, 1.9.2, 1.11.0, 1.10.1, 1.10.2
> Reporter: Valentin
> Priority: Major
> Attachments: image-2022-01-19-14-34-52-879.png,
> image-2022-01-19-15-04-35-442.png
>
>
> I wanted to use the avro enums and evolve my schema over time by adding the
> values.
> From the doc it says :
> {code:java}
> default: A default value for this enumeration, used during resolution when
> the reader encounters a symbol from the writer that isn't defined in the
> reader's schema (optional). The value provided here must be a JSON string
> that's a member of the symbols array. See documentation on schema resolution
> for how this gets used. {code}
>
> And the section of the documentation about schema resolution says :
> [https://avro.apache.org/docs/current/spec.html#Schema+Resolution]
> {code:java}
> if both are enums:
> if the writer's symbol is not present in the reader's enum and the reader has
> a default value, then that value is used, otherwise an error is signalled.
> {code}
> This feature has been introduced in avro 1.9.0 with this issue :
> [https://avro.apache.org/docs/current/spec.html#Enums]
>
> *However I have found that it doesn't work at all like the specification
> says.*
> Here is an example.
>
> If I have a schema used for writing in version 1.
> It has two symbols (A and B) and specify to default to symbol A.
> {code:java}
> {
> "type": "record",
> "name": "RecordA",
> "fields":
> [
> {
> "name": "fieldA",
> "type":
> {
> "type": "enum",
> "name": "Enum1",
> "symbols":
> [
> "A",
> "B"
> ]
> },
> "default": "A"
> }
> ]
> } {code}
> Later when the schema needs a evolvution on the writer, we add a new symbol
> (C) and publish a new schema in version 2.
> And the default value is still A.
> {code:java}
> {
> "type": "record",
> "name": "RecordA",
> "fields":
> [
> {
> "name": "fieldA",
> "type":
> {
> "type": "enum",
> "name": "Enum1",
> "symbols":
> [
> "A",
> "B",
> "C"
> ]
> },
> "default": "A"
> }
> ]
> } {code}
> According to the documentation on the reader side with the old schema in
> version 1, we should be able to deserialize a payload containing an enum
> value of C that was generated by the writer side with the schema in version
> 2. Sinc the value C is unknown by the reader it should be deserialized as A.
> Again as the doc says :
> {code:java}
> A default value for this enumeration, used during resolution when the reader
> encounters a symbol from the writer that isn't defined in the reader's schema
> {code}
> The issue here is either the documentation is wrong or the avro
> deserialization code is wrong. Since this was an intented feature I assume
> that this is a bug and the code is wrong.
>
> I have forked the repository and created a test to demonstrate the issue :
> [https://github.com/idkw/avro/commit/7d36203c137aa6a728d5b85b87969a3f743b45ee]
> The test should verify that the reader side using the old schema should
> deserialize the value A when receiving a value C. However it fails with the
> exception `org.apache.avro.AvroTypeException: No match for C`
> {code:java}
> @Test public void
> enumRecordWithExtendedSchemaCanBeReadIfNewValuesAreUsedUsingDefault() throws
> Exception {
> Schema readerSchemaV1 = ENUM_AB_RECORD_DEFAULT_A;
> Schema writerSchemaV2 = ENUM_ABC_RECORD_DEFAULT_A;
> Record record = defaultRecordWithSchema(
> writerSchemaV2,
> FIELD_A,
> new EnumSymbol(writerSchemaV2, "C")
> );
> byte[] encoded = encodeGenericBlob(record);
> Record decodedRecord = decodeGenericBlob(
> readerSchemaV1,
> writerSchemaV2,
> encoded
> );
> Assert.assertEquals("A", decodedRecord.get(FIELD_A).toString());
> } {code}
>
> It should not fail but deserialize to "A".
--
This message was sent by Atlassian Jira
(v8.20.1#820001)