Cedric Holzer created AVRO-4238:
-----------------------------------
Summary: FastReader fails to unbox nested type when defaulting a
union<array<>> field
Key: AVRO-4238
URL: https://issues.apache.org/jira/browse/AVRO-4238
Project: Apache Avro
Issue Type: Bug
Components: java
Affects Versions: 1.12.1
Environment: Java 21, Gradle 8.11, org.apache.avro:avro:1.12.1
Reporter: Cedric Holzer
h2. Description
When using the FastReader, the schema evolution fails with an
AvroRuntimeException when the reader schema adds a new field of type
union<array<T>, null> with a default value of an empty array.
Fast Reader is enabled by default in 1.12.1, older versions are affected if
FastReader was enabled manually.
h3. Cause
In
[FastReaderBuilder.getDefaultingStep()|https://github.com/apache/avro/blob/4e376735ebbd14cc17e53116183039c8c4ced8ab/lang/java/avro/src/main/java/org/apache/avro/io/FastReaderBuilder.java#L191],
the fast path for non-empty lists calls
{code:java}
data.newArray(old, 0, field.schema()) {code}
field.schema() returns the union schema of the field.
[GenericData.newArray()|https://github.com/apache/avro/blob/4e376735ebbd14cc17e53116183039c8c4ced8ab/lang/java/avro/src/main/java/org/apache/avro/generic/GenericData.java#L1527]
internally calls schema.getElementType(), which is only valid for Array-type
schemas and therefore throws the error we see.
h2. Steps to reproduce
Run the following minimal reproducible example:
{code:java}
import org.apache.avro.Schema;
import org.apache.avro.SchemaBuilder;
import org.apache.avro.generic.GenericData;
import org.apache.avro.generic.GenericDatumReader;
import org.apache.avro.generic.GenericDatumWriter;
import org.apache.avro.io.DatumReader;
import org.apache.avro.io.DatumWriter;
import org.apache.avro.io.Decoder;
import org.apache.avro.io.DecoderFactory;
import org.apache.avro.io.Encoder;
import org.apache.avro.io.EncoderFactory;
import org.apache.avro.specific.SpecificData;
import java.io.ByteArrayInputStream;
import java.io.ByteArrayOutputStream;
import java.io.IOException;
import static java.util.Collections.emptyList;
public class AvroExample {
final static Schema EMPTY_RECORD = SchemaBuilder
.record("EmptyRecord")
.fields()
.endRecord();
// adds union<array<EmptyRecord>, null> someField = []
final static Schema READ_SCHEMA = SchemaBuilder.record("EvolvedRecord")
.fields()
.name("someField")
.type()
.unionOf()
.array()
.items(EMPTY_RECORD)
.and()
.nullType()
.endUnion()
.arrayDefault(emptyList())
.endRecord();
public static void main(String... args) throws IOException {
// Disable fast reader -> works as specified, enable -> throws exception
GenericData model = SpecificData.get().setFastReaderEnabled(true);
// Serialize the empty record with the empty writer Schema
final Schema writeSchema = EMPTY_RECORD;
final byte[] serialized;
try (ByteArrayOutputStream baos = new ByteArrayOutputStream()) {
Encoder encoder = EncoderFactory.get().binaryEncoder(baos, null);
DatumWriter<GenericData.Record> w = new
GenericDatumWriter<>(EMPTY_RECORD, model);
GenericData.Record emptyRecord = new
GenericData.Record(writeSchema);
w.write(emptyRecord, encoder);
encoder.flush();
serialized = baos.toByteArray();
}
// Deserialize with readSchema, Avro should create the new field with
its default value
try (ByteArrayInputStream bais = new ByteArrayInputStream(serialized)) {
Decoder decoder = DecoderFactory.get().directBinaryDecoder(bais,
null);
DatumReader<GenericData.Record> r = new
GenericDatumReader<>(writeSchema, READ_SCHEMA, model);
final Object deserialized = r.read(null, decoder);
System.out.println(deserialized);
}
}
}{code}
h3. Expected Behaviour
Deserialization succeeds, the new field someField was populated with its
default value, the program prints \{"someField": []}. This is the observable
behavior with .setFastReaderEnabled(false).
h3. Actual Behaviour
{noformat}
Exception in thread "main" org.apache.avro.AvroRuntimeException: Not an array:
[{"type":"array","items":{"type":"record","name":"EmptyRecord","fields":[]}},"null"]
at org.apache.avro.Schema.getElementType(Schema.java:374) at
org.apache.avro.generic.GenericData.newArray(GenericData.java:1528) at
org.apache.avro.io.FastReaderBuilder.lambda$getDefaultingStep$5(FastReaderBuilder.java:199)
at
org.apache.avro.io.FastReaderBuilder.lambda$createFieldSetter$1(FastReaderBuilder.java:181)
at
org.apache.avro.io.FastReaderBuilder$RecordReader.read(FastReaderBuilder.java:575)
at
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:150) at
AvroExample.main(AvroExample.java:61){noformat}
h2. Suggested Change
In
[FastReaderBuilder.getDefaultingStep()|https://github.com/apache/avro/blob/4e376735ebbd14cc17e53116183039c8c4ced8ab/lang/java/avro/src/main/java/org/apache/avro/io/FastReaderBuilder.java#L198],
when the default value is an empty list, it could be checked whether the type
is an Union and if so, unbox the first child of type Array:
{code:java}
// Current (broken):
(old, d) -> data.newArray(old, 0, field.schema())
// Fix — unwrap union to find the array branch:
Schema arraySchema = field.schema();
if (arraySchema.getType() == Schema.Type.UNION) {
arraySchema = arraySchema.getTypes().stream()
.filter(s -> s.getType() == Schema.Type.ARRAY)
.findFirst()
.orElse(arraySchema);
}
(old, d) -> data.newArray(old, 0, arraySchema){code}
h2. Workaround
Disable FastReader by setting
{code:java}
-Dorg.apache.avro.fastread=false{code}
or change your type to be
{code:java}
union<null, array<T>> = null{code}
which works correctly but changes the default value to null instead of `[]`.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)