dylanburati commented on issue #2986:
URL: https://github.com/apache/parquet-java/issues/2986#issuecomment-2309079785

   I have the same issue with a corrupted file due to overflow in this field; 
it was created using the Rust parquet crate, which uses unsigned ints for this 
field 
([link](https://github.com/apache/arrow-rs/blob/855666d9e9283c1ef11648762fe92c7c188b68f1/parquet/src/file/footer.rs#L133)).
 Also, the file is usable with `pyarrow`. I'm wondering if this specific field 
could be treated as unsigned in Java as well, since it doesn't seem to be 
referenced as `i32` in the format 
[specification](https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift).
   
   ```
   $ tail -c 64 ~/Downloads/enwiki/20240620/enwiki_20240620.parquet | xxd -g 4  
                                                                                
                                      
   00000000: 41414141 41414141 41454141 41414141  AAAAAAAAAEAAAAAA              
                                                                                
                                      
   00000010: 67414141 476c6b41 41413d00 18197061  gAAAGlkAAA=...pa              
                                                                                
                                      
   00000020: 72717565 742d7273 20766572 73696f6e  rquet-rs version              
                                                                                
                                      
   00000030: 2033342e 302e3000 e755eb8a 50415231   34.0.0..U..PAR1              
                                                                                
                                      
   
   $ parquet pages ~/Downloads/enwiki/20240620/enwiki_20240620.parquet          
                                                                                
                                      
   Unknown error                                   
   java.lang.RuntimeException: corrupted file: the footer index is not within 
the file: 39975304334                                                           
                                        
           at 
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:608)
                                                                                
                      
           at 
org.apache.parquet.hadoop.ParquetFileReader.<init>(ParquetFileReader.java:902)  
                                                                                
                        
           at 
org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:659)    
                                                                                
                        
           at 
org.apache.parquet.cli.commands.ShowPagesCommand.run(ShowPagesCommand.java:93)  
                                                                                
                        
           at org.apache.parquet.cli.Main.run(Main.java:163)                    
                    
           at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82)         
                                                                                
                                      
           at org.apache.parquet.cli.Main.main(Main.java:191)                   
                                                                                
                                      
   
   $ python -c "print($(stat -c %s 
~/Downloads/enwiki/20240620/enwiki_20240620.parquet) - 8 - (-0x10000_0000 + 
0x8aeb_55e7))"                                                                  
       
   39975304334                                     
   
   $ python -c 'import pyarrow.parquet as pq; f = 
pq.ParquetFile("~/Downloads/enwiki/20240620/enwiki_20240620.parquet"); 
print(f.metadata)'
   <pyarrow._parquet.FileMetaData object at 0x729a06892a70>
     created_by: parquet-rs version 34.0.0
     num_columns: 6
     num_rows: 23802888
     num_row_groups: 238062
     format_version: 1.0
     serialized_size: 2330678759
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to