Cedric Holzer created AVRO-4238:
-----------------------------------

             Summary: FastReader fails to unbox nested type when defaulting a 
union<array<>> field
                 Key: AVRO-4238
                 URL: https://issues.apache.org/jira/browse/AVRO-4238
             Project: Apache Avro
          Issue Type: Bug
          Components: java
    Affects Versions: 1.12.1
         Environment: Java 21, Gradle 8.11, org.apache.avro:avro:1.12.1
            Reporter: Cedric Holzer


h2. Description

When using the FastReader, the schema evolution fails with an 
AvroRuntimeException when the reader schema adds a new field of type 
union<array<T>, null> with a default value of an empty array.

Fast Reader is enabled by default in 1.12.1, older versions are affected if 
FastReader was enabled manually.
h3. Cause

In 
[FastReaderBuilder.getDefaultingStep()|https://github.com/apache/avro/blob/4e376735ebbd14cc17e53116183039c8c4ced8ab/lang/java/avro/src/main/java/org/apache/avro/io/FastReaderBuilder.java#L191],
 the fast path for non-empty lists calls 
{code:java}
data.newArray(old, 0, field.schema()) {code}
field.schema() returns the union schema of the field. 
[GenericData.newArray()|https://github.com/apache/avro/blob/4e376735ebbd14cc17e53116183039c8c4ced8ab/lang/java/avro/src/main/java/org/apache/avro/generic/GenericData.java#L1527]
 internally calls schema.getElementType(), which is only valid for Array-type 
schemas and therefore throws the error we see.
h2. Steps to reproduce

Run the following minimal reproducible example:
{code:java}
import org.apache.avro.Schema;
import org.apache.avro.SchemaBuilder;
import org.apache.avro.generic.GenericData;
import org.apache.avro.generic.GenericDatumReader;
import org.apache.avro.generic.GenericDatumWriter;
import org.apache.avro.io.DatumReader;
import org.apache.avro.io.DatumWriter;
import org.apache.avro.io.Decoder;
import org.apache.avro.io.DecoderFactory;
import org.apache.avro.io.Encoder;
import org.apache.avro.io.EncoderFactory;
import org.apache.avro.specific.SpecificData;

import java.io.ByteArrayInputStream;
import java.io.ByteArrayOutputStream;
import java.io.IOException;

import static java.util.Collections.emptyList;

public class AvroExample {

    final static Schema EMPTY_RECORD = SchemaBuilder
            .record("EmptyRecord")
            .fields()
            .endRecord();

    // adds union<array<EmptyRecord>, null> someField = []
    final static Schema READ_SCHEMA = SchemaBuilder.record("EvolvedRecord")
            .fields()
            .name("someField")
            .type()
            .unionOf()
            .array()
            .items(EMPTY_RECORD)
            .and()
            .nullType()
            .endUnion()
            .arrayDefault(emptyList())
            .endRecord();

    public static void main(String... args) throws IOException {
        // Disable fast reader -> works as specified, enable -> throws exception
        GenericData model = SpecificData.get().setFastReaderEnabled(true);

        // Serialize the empty record with the empty writer Schema
        final Schema writeSchema = EMPTY_RECORD;
        final byte[] serialized;
        try (ByteArrayOutputStream baos = new ByteArrayOutputStream()) {
            Encoder encoder = EncoderFactory.get().binaryEncoder(baos, null);
            DatumWriter<GenericData.Record> w = new 
GenericDatumWriter<>(EMPTY_RECORD, model);
            GenericData.Record emptyRecord = new 
GenericData.Record(writeSchema);
            w.write(emptyRecord, encoder);
            encoder.flush();
            serialized = baos.toByteArray();
        }

        // Deserialize with readSchema, Avro should create the new field with 
its default value
        try (ByteArrayInputStream bais = new ByteArrayInputStream(serialized)) {
            Decoder decoder = DecoderFactory.get().directBinaryDecoder(bais, 
null);
            DatumReader<GenericData.Record> r = new 
GenericDatumReader<>(writeSchema, READ_SCHEMA, model);
            final Object deserialized = r.read(null, decoder);
            System.out.println(deserialized);
        }
    }
}{code}
h3. Expected Behaviour

Deserialization succeeds, the new field someField was populated with its 
default value, the program prints \{"someField": []}. This is the observable 
behavior with .setFastReaderEnabled(false).
h3. Actual Behaviour
{noformat}
Exception in thread "main" org.apache.avro.AvroRuntimeException: Not an array: 
[{"type":"array","items":{"type":"record","name":"EmptyRecord","fields":[]}},"null"]
     at org.apache.avro.Schema.getElementType(Schema.java:374)       at 
org.apache.avro.generic.GenericData.newArray(GenericData.java:1528)  at 
org.apache.avro.io.FastReaderBuilder.lambda$getDefaultingStep$5(FastReaderBuilder.java:199)
  at 
org.apache.avro.io.FastReaderBuilder.lambda$createFieldSetter$1(FastReaderBuilder.java:181)
  at 
org.apache.avro.io.FastReaderBuilder$RecordReader.read(FastReaderBuilder.java:575)
   at 
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:150) at 
AvroExample.main(AvroExample.java:61){noformat}
h2. Suggested Change

In 
[FastReaderBuilder.getDefaultingStep()|https://github.com/apache/avro/blob/4e376735ebbd14cc17e53116183039c8c4ced8ab/lang/java/avro/src/main/java/org/apache/avro/io/FastReaderBuilder.java#L198],
 when the default value is an empty list, it could be checked whether the type 
is an Union and if so, unbox the first child of type Array:
{code:java}
// Current (broken):
(old, d) -> data.newArray(old, 0, field.schema())

// Fix — unwrap union to find the array branch:
Schema arraySchema = field.schema();
if (arraySchema.getType() == Schema.Type.UNION) {
    arraySchema = arraySchema.getTypes().stream()
        .filter(s -> s.getType() == Schema.Type.ARRAY)
        .findFirst()
        .orElse(arraySchema);
}
(old, d) -> data.newArray(old, 0, arraySchema){code}
h2. Workaround

Disable FastReader by setting 
{code:java}
-Dorg.apache.avro.fastread=false{code}
or change your type to be 
{code:java}
union<null, array<T>> = null{code}
which works correctly but changes the default value to null instead of `[]`.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to