[
https://issues.apache.org/jira/browse/ARROW-14723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17445125#comment-17445125
]
Joris Van den Bossche commented on ARROW-14723:
-----------------------------------------------
One problem with this file is that it indicates a negative number of rows
(that's the direct cause of the error above):
{code:python}
In [3]: pq.read_metadata("../Downloads/intmax32plus1.parq")
Out[3]:
<pyarrow._parquet.FileMetaData object at 0x7f3ee0eb6310>
created_by: parquet-cpp-arrow version 4.0.1
num_columns: 1
num_rows: -2147483648
num_row_groups: 1
format_version: 2.6
serialized_size: 330
In [4]: pq.read_metadata("../Downloads/intmax32plus1.parq").row_group(0)
Out[4]:
<pyarrow._parquet.RowGroupMetaData object at 0x7f3e8e0b8400>
num_columns: 1
num_rows: -2147483648
total_byte_size: 40470
{code}
I didn't yet further check if this is an issue on the reader side, or actually
an issue with the file (thus potentially an issue with the writer side)
[~sgilmore] could you share the code how you created those files?
> [Python] pyarrow cannot import parquet files containing row groups whose
> lengths exceed int32 max.
> ---------------------------------------------------------------------------------------------------
>
> Key: ARROW-14723
> URL: https://issues.apache.org/jira/browse/ARROW-14723
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 5.0.0
> Reporter: Sarah Gilmore
> Priority: Minor
> Attachments: intmax32.parq, intmax32plus1.parq
>
>
> It's possible to create Parquet files containing row groups whose lengths are
> greater than int32 max (2147483647). However, Pyarrow cannot read these
> files.
> {code:java}
> >>> import pyarrow as pa
> >>> import pyarrow.parquet as pq
> # intmax32.parq can be read in without any issues
> >>> t = pq.read_table("intmax32.parq");
> $ intmax32plus1.parq cannot be read in
> >>> t = pq.read_table("intmax32plus1.parq");
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> File
> "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pyarrow/parquet.py",
> line 1895, in read_table
> return dataset.read(columns=columns, use_threads=use_threads,
> File
> "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pyarrow/parquet.py",
> line 1744, in read
> table = self._dataset.to_table(
> File "pyarrow/_dataset.pyx", line 465, in pyarrow._dataset.Dataset.to_table
> File "pyarrow/_dataset.pyx", line 3075, in pyarrow._dataset.Scanner.to_table
> File "pyarrow/error.pxi", line 143, in
> pyarrow.lib.pyarrow_internal_check_status
> File "pyarrow/error.pxi", line 114, in pyarrow.lib.check_status
> OSError: Negative size (corrupt file?)
> {code}
>
> However, both files can be imported via the C++ Arrow bindings without any
> issues.
>
--
This message was sent by Atlassian Jira
(v8.20.1#820001)