Hi,

Please see code below that reproduces the scenario:

Schema schema = new Schema.Parser().parse("""
  {
    "type": "record",
    "name": "person",
    "fields": [
      {
        "name": "address",
        "type": [
          "null",
          {
            "type": "array",
            "items": "string"
          }
        ],
        "default": null
      }
    ]
  }
"""
);

 ParquetWriter<GenericRecord> writer =
AvroParquetWriter.<GenericRecord>builder(new
org.apache.hadoop.fs.Path("/tmp/person.parquet"))
  .withSchema(schema)
  .build();

try {
  writer.write(new GenericRecordBuilder(schema).set("address",
Arrays.asList("first", null, "last")).build());
} catch (Exception e) {
  e.printStackTrace();
}

try {
  writer.write(new GenericRecordBuilder(schema).set("address",
Collections.singletonList("first")).build());
} catch (Exception e) {
  e.printStackTrace();
}


The first call to AvroParquetWriter#write attempts to add an array with a
null element and fails - as expected - with "java.lang.NullPointerException:
Array contains a null element at 1". However, at this point all subsequent
calls (with valid records) to AvroParquetWriter#write will fail with
"org.apache.parquet.io.InvalidRecordException:
1(r) > 0 ( schema r)" as apparently the state within the RecordConsumer isn't
being reset between writes.

Is this the indented behavior of the writer? And if so, does one have to
create a new writer whenever a write fails?

Best Regards,
Øyvind Strømmen

Reply via email to