jhorstmann commented on issue #5338:
URL: https://github.com/apache/arrow-rs/issues/5338#issuecomment-1914458985
Trying to read the file with polars (which forked the parquet2 code) results
in the following error:
> parquet: File out of specification: The number of bytes declared in v1 def
levels is higher than the page size
It seems it is trying to read a 4-byte length prefix, which should only be
written for rle encoding.
I also asked a colleague with a working python setup to try read the file
and he reported the following outcome
- `fastparquet`: endless loop
- `pyarrow`: error `Received invalid number of bytes (corrupt data page?)`
The latter seems similar to polars, [looking at the
code](https://github.com/apache/arrow/blob/21ffd82c05c93b873ae3c27128eb8604ed0c735f/cpp/src/parquet/column_reader.cc#L144)
it's a bounds check that expected 4 additional bytes even if the number of
bytes is calculated based on bit width.
Considering all these issues, it would probably be better to remove support
for bitpacked levels completely.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]