Hi

While reading a custom parquet file that has extra information embedded
(some custom stats), pyarrow is failing to read it.


Traceback (most recent call last):

  File "/tmp/pytest.py", line 19, in <module>

    table = dataset.read()

  File "/usr/local/lib/python3.7/site-packages/pyarrow/parquet.py", line
214, in read

    use_threads=use_threads)

  File "pyarrow/_parquet.pyx", line 737, in
pyarrow._parquet.ParquetReader.read_all

  File "pyarrow/error.pxi", line 83, in pyarrow.lib.check_status

pyarrow.lib.ArrowIOError: Couldn't deserialize thrift: TProtocolException:
Invalid data

Deserializing page header failed.



Looking at the code, I realised that SerializedPageReader throws exception
if the page header size goes beyond 16k (default max). There is a setter
method for the max page header size that is used only in tests.


Is there a way to get around the problem?


Regards

Shyam

Reply via email to