Scott Carey created PARQUET-1407:
------------------------------------
Summary: Data loss on duplicate values with
AvroParquetWriter/Reader
Key: PARQUET-1407
URL: https://issues.apache.org/jira/browse/PARQUET-1407
Project: Parquet
Issue Type: Bug
Components: parquet-avro
Affects Versions: 1.8.3
Reporter: Scott Carey
{code:java}
public class Blah {
private static Path parquetFile = new Path("oops");
private static Schema schema = SchemaBuilder.record("spark_schema")
.fields().optionalBytes("value").endRecord();
private static GenericData.Record recordFor(String value) {
return new GenericRecordBuilder(schema)
.set("value", value.getBytes()).build();
}
public static void main(String ... args) throws IOException {
try (ParquetWriter<GenericData.Record> writer = AvroParquetWriter
.<GenericData.Record>builder(parquetFile)
.withSchema(schema)
.build()) {
writer.write(recordFor("one"));
writer.write(recordFor("two"));
writer.write(recordFor("three"));
writer.write(recordFor("three"));
writer.write(recordFor("two"));
writer.write(recordFor("one"));
writer.write(recordFor("zero"));
}
try (ParquetReader<GenericRecord> reader = AvroParquetReader
.<GenericRecord>builder(parquetFile)
.withConf(new Configuration()).build()) {
GenericRecord rec;
int i = 0;
while ((rec = reader.read()) != null) {
ByteBuffer buf = (ByteBuffer) rec.get("value");
byte[] bytes = new byte[buf.remaining()];
buf.get(bytes);
System.out.println("rec " + i++ + ": " + new String(bytes));
}
}
}
}
{code}
Expected output:
{noformat}
rec 0: one
rec 1: two
rec 2: three
rec 3: three
rec 4: two
rec 5: one
rec 6: zero{noformat}
Actual:
{noformat}
rec 0: one
rec 1: two
rec 2: three
rec 3:
rec 4:
rec 5:
rec 6: zero{noformat}
This was found when we started getting empty byte[] values back in spark
unexpectedly. (Spark 2.3.1 and Parquet 1.8.3). I have not tried to reproduce
with parquet 1.9.0, but its a bad enough bug that I would like a 1.8.4 release
that I can drop-in replace 1.8.3 without any binary compatibility issues.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)