Sarah Gilmore created ARROW-14723:
-------------------------------------

             Summary: [Python] pyarrow cannot import parquet files containing 
row groups whose lengths exceed int32 max. 
                 Key: ARROW-14723
                 URL: https://issues.apache.org/jira/browse/ARROW-14723
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 5.0.0
            Reporter: Sarah Gilmore
         Attachments: intmax32.parq, intmax32plus1.parq

It's possible to create Parquet files containing row groups whose lengths are 
greater than int32 max (2147483647). However, Pyarrow cannot read these files. 
{code:java}
>>> import pyarrow as pa
>>> import pyarrow.parquet as pq

# intmax32.parq can be read in without any issues
>>> t = pq.read_table("intmax32.parq"); 

$ intmax32plus1.parq cannot be read in
>>> t = pq.read_table("intmax32plus1.parq"); 
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File 
"/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pyarrow/parquet.py",
 line 1895, in read_table
    return dataset.read(columns=columns, use_threads=use_threads,
  File 
"/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pyarrow/parquet.py",
 line 1744, in read
    table = self._dataset.to_table(
  File "pyarrow/_dataset.pyx", line 465, in pyarrow._dataset.Dataset.to_table
  File "pyarrow/_dataset.pyx", line 3075, in pyarrow._dataset.Scanner.to_table
  File "pyarrow/error.pxi", line 143, in 
pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 114, in pyarrow.lib.check_status
OSError: Negative size (corrupt file?)


{code}
 

However, both files can be imported via the C++ Arrow bindings without any 
issues.

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to