Joris Van den Bossche created ARROW-13655:
---------------------------------------------

             Summary: [C++][Parquet] Reading large Parquet file can give 
"MaxMessageSize reached" error with Thrift 0.14
                 Key: ARROW-13655
                 URL: https://issues.apache.org/jira/browse/ARROW-13655
             Project: Apache Arrow
          Issue Type: Bug
          Components: C++, Parquet
            Reporter: Joris Van den Bossche


>From https://github.com/dask/dask/issues/8027

Apache Thrift introduced a `MaxMessageSize` configuration option 
(https://github.com/apache/thrift/blob/master/doc/specs/thrift-tconfiguration.md#maxmessagesize)
 in version 0.14 (THRIFT-5237). 

I think this is the cause of an issue reported originally at 
https://github.com/dask/dask/issues/8027, where one can get a _"OSError: 
Couldn't deserialize thrift: MaxMessageSize reached"_ error while reading a 
large Parquet (metadata-only) file. 

In the original report, the file was writting using the python fastparquet 
library (which uses the python thrift bindings, which still use Thrift 0.13), 
but I was able to construct a reproducible code example with pyarrow.

Create a large metadata Parquet file with pyarrow in an environment with Arrow 
built against Thrift 0.13 (eg with a local install from source, or installing 
pyarrow 2.0 from conda-forge can be installed with libthrift 0.13):

{code:python}
import pyarrow as pa
import pyarrow.parquet as pq

table = pa.table({str(i): np.random.randn(10) for i in range(1_000)})
pq.write_table(table, "__temp_file_for_metadata.parquet")
metadata = pq.read_metadata("__temp_file_for_metadata.parquet")
metadata2 = pq.read_metadata("__temp_file_for_metadata.parquet")

[metadata.append_row_groups(metadata2) for _ in range(4000)]
metadata.write_metadata_file("test_parquet_metadata_large_file.parquet")
{code}

And then reading this file again in the same environment works fine, but 
reading it in an environment with recent Thrift 0.14 (eg installing latest 
pyarrow with conda-forge) gives the following error:

{code:python}
In [1]: import pyarrow.parquet as pq

In [2]: pq.read_metadata("test_parquet_metadata_large_file.parquet")
...
OSError: Couldn't deserialize thrift: MaxMessageSize reached
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to