jhorstmann commented on issue #5338:
URL: https://github.com/apache/arrow-rs/issues/5338#issuecomment-1914458985

   Trying to read the file with polars (which forked the parquet2 code) results 
in the following error:
   
   > parquet: File out of specification: The number of bytes declared in v1 def 
levels is higher than the page size
   
   It seems it is trying to read a 4-byte length prefix, which should only be 
written for rle encoding.
   
   I also asked a colleague with a working python setup to try read the file 
and he reported the following outcome
   
    - `fastparquet`: endless loop 
    - `pyarrow`: error `Received invalid number of bytes (corrupt data page?)`
    
    The latter seems similar to polars, [looking at the 
code](https://github.com/apache/arrow/blob/21ffd82c05c93b873ae3c27128eb8604ed0c735f/cpp/src/parquet/column_reader.cc#L144)
 it's a bounds check that expected 4 additional bytes even if the number of 
bytes is calculated based on bit width.
   
   Considering all these issues, it would probably be better to remove support 
for bitpacked levels completely.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to