ccleva opened a new issue, #3336: URL: https://github.com/apache/parquet-java/issues/3336
### Describe the bug, including details regarding any error messages, version, and platform. Tested using v1.16.0 on openJDK 11 and 17. 1. [nation.dict-malformed.parquet](https://github.com/apache/parquet-testing/blob/master/data/nation.dict-malformed.parquet) ``` > java -cp parquet-cli-1.16.0.jar:dependency/* org.apache.parquet.cli.Main cat nation.dict-malformed.parquet Unknown error java.lang.RuntimeException: Failed on record 0 in file nation.dict-malformed.parquet at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:89) at org.apache.parquet.cli.Main.run(Main.java:169) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) at org.apache.parquet.cli.Main.main(Main.java:197) Caused by: java.lang.RuntimeException: Failed while reading Parquet file: nation.dict-malformed.parquet at org.apache.parquet.cli.BaseCommand$1$1.advance(BaseCommand.java:360) at org.apache.parquet.cli.BaseCommand$1$1.<init>(BaseCommand.java:337) at org.apache.parquet.cli.BaseCommand$1.iterator(BaseCommand.java:335) at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:76) ... 3 more Caused by: java.io.EOFException at org.apache.parquet.bytes.SingleBufferInputStream.sliceBuffers(SingleBufferInputStream.java:134) at org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAsBytesInput(ParquetFileReader.java:2100) at org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:1990) at org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:1920) at org.apache.parquet.hadoop.ParquetFileReader.readChunkPages(ParquetFileReader.java:1454) at org.apache.parquet.hadoop.ParquetFileReader.internalReadRowGroup(ParquetFileReader.java:1188) at org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:1135) at org.apache.parquet.hadoop.ParquetFileReader.readNextFilteredRowGroup(ParquetFileReader.java:1380) at org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:140) at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:245) at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:136) at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:140) at org.apache.parquet.cli.BaseCommand$1$1.advance(BaseCommand.java:356) ... 6 more ``` This seems related to an issue with an older version of the java writer: apache/arrow#42298 It's been fixed in the C++/python version by apache/parquet-cpp#209 (file loads fine with pyArrow), but maybe not in the hadoop reader ? Links to the fix and the test in current version: https://github.com/apache/arrow/blob/64f2055ffb68e5077420f4253e76d78952438cab/cpp/src/parquet/file_reader.cc#L199 https://github.com/apache/arrow/blob/64f2055ffb68e5077420f4253e76d78952438cab/cpp/src/parquet/reader_test.cc#L977 Note that the file can be read by the old parquet-tools (tested with v1.10.1). 2. [fixed_length_byte_array.parquet](https://github.com/apache/parquet-testing/blob/a3d96a65e11e2bbca7d22a894e8313ede90a33a3/data/fixed_length_byte_array.parquet) pyArrow (and parquet-tools) also fails to read this one, so I think it's a wider problem. I'll open an issue on their repository, this is more to let you know. ``` > java -cp parquet-cli-1.16.0.jar:dependency/* org.apache.parquet.cli.Main cat fixed_length_byte_array.parquet {"flba_field": [0, 0, 3, -24]} [...] {"flba_field": [0, 0, 3, -122]} Unknown error java.lang.RuntimeException: Failed on record 90 in file fixed_length_byte_array.parquet at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:89) at org.apache.parquet.cli.Main.run(Main.java:169) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) at org.apache.parquet.cli.Main.main(Main.java:197) Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value at 92 in block 0 in file file:/home/ccleva/dev/tlabs-data/tablesaw-parquet/target/test/data/parquet-testing-master/data/fixed_length_byte_array.parquet at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:280) at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:136) at org.apache.parquet.cli.BaseCommand$1$1.advance(BaseCommand.java:356) at org.apache.parquet.cli.BaseCommand$1$1.next(BaseCommand.java:350) at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:76) ... 3 more Caused by: org.apache.parquet.io.ParquetDecodingException: Can't read value in column [flba_field] required fixed_len_byte_array(4) flba_field at value 92 out of 1000, 92 out of 100 in currentPage. repetition level: 0, definition level: 0 at org.apache.parquet.column.impl.ColumnReaderBase.readValue(ColumnReaderBase.java:604) at org.apache.parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:30) at org.apache.parquet.column.impl.ColumnReaderBase.writeCurrentValueToConverter(ColumnReaderBase.java:477) at org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:30) at org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:425) at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:249) ... 7 more Caused by: org.apache.parquet.io.ParquetDecodingException: could not read bytes at offset 364 at org.apache.parquet.column.values.plain.FixedLenByteArrayPlainValuesReader.readBytes(FixedLenByteArrayPlainValuesReader.java:47) at org.apache.parquet.column.impl.ColumnReaderBase$2$6.read(ColumnReaderBase.java:411) at org.apache.parquet.column.impl.ColumnReaderBase.readValue(ColumnReaderBase.java:579) ... 12 more Caused by: java.io.EOFException at org.apache.parquet.bytes.SingleBufferInputStream.slice(SingleBufferInputStream.java:116) at org.apache.parquet.column.values.plain.FixedLenByteArrayPlainValuesReader.readBytes(FixedLenByteArrayPlainValuesReader.java:45) ... 14 more ``` I recreated the file using the script in apache/parquet-testing#31 in case it was corrupted but got the same result (at a different offset). ### Component(s) _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
