[
https://issues.apache.org/jira/browse/PARQUET-1407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16692009#comment-16692009
]
ASF GitHub Bot commented on PARQUET-1407:
-----------------------------------------
nandorKollar opened a new pull request #552: PARQUET-1407: Data loss on
duplicate values with AvroParquetWriter/Reader
URL: https://github.com/apache/parquet-mr/pull/552
Based on #551: added a test case for the fix.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> Data loss on duplicate values with AvroParquetWriter/Reader
> -----------------------------------------------------------
>
> Key: PARQUET-1407
> URL: https://issues.apache.org/jira/browse/PARQUET-1407
> Project: Parquet
> Issue Type: Bug
> Components: parquet-avro
> Affects Versions: 1.9.0, 1.10.0, 1.8.3
> Reporter: Scott Carey
> Priority: Critical
> Labels: pull-request-available
> Fix For: 1.11.0
>
>
> {code:java}
> public class Blah {
> private static Path parquetFile = new Path("oops");
> private static Schema schema = SchemaBuilder.record("spark_schema")
> .fields().optionalBytes("value").endRecord();
> private static GenericData.Record recordFor(String value) {
> return new GenericRecordBuilder(schema)
> .set("value", value.getBytes()).build();
> }
> public static void main(String ... args) throws IOException {
> try (ParquetWriter<GenericData.Record> writer = AvroParquetWriter
> .<GenericData.Record>builder(parquetFile)
> .withSchema(schema)
> .build()) {
> writer.write(recordFor("one"));
> writer.write(recordFor("two"));
> writer.write(recordFor("three"));
> writer.write(recordFor("three"));
> writer.write(recordFor("two"));
> writer.write(recordFor("one"));
> writer.write(recordFor("zero"));
> }
> try (ParquetReader<GenericRecord> reader = AvroParquetReader
> .<GenericRecord>builder(parquetFile)
> .withConf(new Configuration()).build()) {
> GenericRecord rec;
> int i = 0;
> while ((rec = reader.read()) != null) {
> ByteBuffer buf = (ByteBuffer) rec.get("value");
> byte[] bytes = new byte[buf.remaining()];
> buf.get(bytes);
> System.out.println("rec " + i++ + ": " + new String(bytes));
> }
> }
> }
> }
> {code}
> Expected output:
> {noformat}
> rec 0: one
> rec 1: two
> rec 2: three
> rec 3: three
> rec 4: two
> rec 5: one
> rec 6: zero{noformat}
> Actual:
> {noformat}
> rec 0: one
> rec 1: two
> rec 2: three
> rec 3:
> rec 4:
> rec 5:
> rec 6: zero{noformat}
>
> This was found when we started getting empty byte[] values back in spark
> unexpectedly. (Spark 2.3.1 and Parquet 1.8.3). I have not tried to
> reproduce with parquet 1.9.0, but its a bad enough bug that I would like a
> 1.8.4 release that I can drop-in replace 1.8.3 without any binary
> compatibility issues.
> Duplicate byte[] values are lost.
>
> A few clues:
> If I do not call ByteBuffer.get, the size of ByteBuffer.remaining does not go
> to zero. I suspect a ByteBuffer is being recycled, but the call to
> ByteBuffer.get mutates it. I wonder if an appropriately placed
> ByteBuffer.duplicate() would fix it.
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)