Hi While reading a custom parquet file that has extra information embedded (some custom stats), pyarrow is failing to read it.
Traceback (most recent call last): File "/tmp/pytest.py", line 19, in <module> table = dataset.read() File "/usr/local/lib/python3.7/site-packages/pyarrow/parquet.py", line 214, in read use_threads=use_threads) File "pyarrow/_parquet.pyx", line 737, in pyarrow._parquet.ParquetReader.read_all File "pyarrow/error.pxi", line 83, in pyarrow.lib.check_status pyarrow.lib.ArrowIOError: Couldn't deserialize thrift: TProtocolException: Invalid data Deserializing page header failed. Looking at the code, I realised that SerializedPageReader throws exception if the page header size goes beyond 16k (default max). There is a setter method for the max page header size that is used only in tests. Is there a way to get around the problem? Regards Shyam