Hi,
Please see code below that reproduces the scenario:
Schema schema = new Schema.Parser().parse("""
{
"type": "record",
"name": "person",
"fields": [
{
"name": "address",
"type": [
"null",
{
"type": "array",
"items": "string"
}
],
"default": null
}
]
}
"""
);
ParquetWriter<GenericRecord> writer =
AvroParquetWriter.<GenericRecord>builder(new
org.apache.hadoop.fs.Path("/tmp/person.parquet"))
.withSchema(schema)
.build();
try {
writer.write(new GenericRecordBuilder(schema).set("address",
Arrays.asList("first", null, "last")).build());
} catch (Exception e) {
e.printStackTrace();
}
try {
writer.write(new GenericRecordBuilder(schema).set("address",
Collections.singletonList("first")).build());
} catch (Exception e) {
e.printStackTrace();
}
The first call to AvroParquetWriter#write attempts to add an array with a
null element and fails - as expected - with "java.lang.NullPointerException:
Array contains a null element at 1". However, at this point all subsequent
calls (with valid records) to AvroParquetWriter#write will fail with
"org.apache.parquet.io.InvalidRecordException:
1(r) > 0 ( schema r)" as apparently the state within the RecordConsumer isn't
being reset between writes.
Is this the indented behavior of the writer? And if so, does one have to
create a new writer whenever a write fails?
Best Regards,
Øyvind Strømmen