dylanburati commented on issue #2986: URL: https://github.com/apache/parquet-java/issues/2986#issuecomment-2309079785
I have the same issue with a corrupted file due to overflow in this field; it was created using the Rust parquet crate, which uses unsigned ints for this field ([link](https://github.com/apache/arrow-rs/blob/855666d9e9283c1ef11648762fe92c7c188b68f1/parquet/src/file/footer.rs#L133)). Also, the file is usable with `pyarrow`. I'm wondering if this specific field could be treated as unsigned in Java as well, since it doesn't seem to be referenced as `i32` in the format [specification](https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift). ``` $ tail -c 64 ~/Downloads/enwiki/20240620/enwiki_20240620.parquet | xxd -g 4 00000000: 41414141 41414141 41454141 41414141 AAAAAAAAAEAAAAAA 00000010: 67414141 476c6b41 41413d00 18197061 gAAAGlkAAA=...pa 00000020: 72717565 742d7273 20766572 73696f6e rquet-rs version 00000030: 2033342e 302e3000 e755eb8a 50415231 34.0.0..U..PAR1 $ parquet pages ~/Downloads/enwiki/20240620/enwiki_20240620.parquet Unknown error java.lang.RuntimeException: corrupted file: the footer index is not within the file: 39975304334 at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:608) at org.apache.parquet.hadoop.ParquetFileReader.<init>(ParquetFileReader.java:902) at org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:659) at org.apache.parquet.cli.commands.ShowPagesCommand.run(ShowPagesCommand.java:93) at org.apache.parquet.cli.Main.run(Main.java:163) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82) at org.apache.parquet.cli.Main.main(Main.java:191) $ python -c "print($(stat -c %s ~/Downloads/enwiki/20240620/enwiki_20240620.parquet) - 8 - (-0x10000_0000 + 0x8aeb_55e7))" 39975304334 $ python -c 'import pyarrow.parquet as pq; f = pq.ParquetFile("~/Downloads/enwiki/20240620/enwiki_20240620.parquet"); print(f.metadata)' <pyarrow._parquet.FileMetaData object at 0x729a06892a70> created_by: parquet-rs version 34.0.0 num_columns: 6 num_rows: 23802888 num_row_groups: 238062 format_version: 1.0 serialized_size: 2330678759 ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
