[
https://issues.apache.org/jira/browse/ARROW-14723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17445125#comment-17445125
]
Joris Van den Bossche edited comment on ARROW-14723 at 11/17/21, 12:11 PM:
---------------------------------------------------------------------------
One problem with this file is that it indicates a negative number of rows
(that's the direct cause of the error above):
{code:python}
In [3]: pq.read_metadata("../Downloads/intmax32plus1.parq")
Out[3]:
<pyarrow._parquet.FileMetaData object at 0x7f3ee0eb6310>
created_by: parquet-cpp-arrow version 4.0.1
num_columns: 1
num_rows: -2147483648
num_row_groups: 1
format_version: 2.6
serialized_size: 330
In [4]: pq.read_metadata("../Downloads/intmax32plus1.parq").row_group(0)
Out[4]:
<pyarrow._parquet.RowGroupMetaData object at 0x7f3e8e0b8400>
num_columns: 1
num_rows: -2147483648
total_byte_size: 40470
{code}
I didn't yet further check if this is an issue on the reader side (incorrectly
reading the metadata), or actually an issue with the file (thus potentially an
issue with the writer side)
[~sgilmore] could you share the code how you created those files?
was (Author: jorisvandenbossche):
One problem with this file is that it indicates a negative number of rows
(that's the direct cause of the error above):
{code:python}
In [3]: pq.read_metadata("../Downloads/intmax32plus1.parq")
Out[3]:
<pyarrow._parquet.FileMetaData object at 0x7f3ee0eb6310>
created_by: parquet-cpp-arrow version 4.0.1
num_columns: 1
num_rows: -2147483648
num_row_groups: 1
format_version: 2.6
serialized_size: 330
In [4]: pq.read_metadata("../Downloads/intmax32plus1.parq").row_group(0)
Out[4]:
<pyarrow._parquet.RowGroupMetaData object at 0x7f3e8e0b8400>
num_columns: 1
num_rows: -2147483648
total_byte_size: 40470
{code}
I didn't yet further check if this is an issue on the reader side, or actually
an issue with the file (thus potentially an issue with the writer side)
[~sgilmore] could you share the code how you created those files?
> [Python] pyarrow cannot import parquet files containing row groups whose
> lengths exceed int32 max.
> ---------------------------------------------------------------------------------------------------
>
> Key: ARROW-14723
> URL: https://issues.apache.org/jira/browse/ARROW-14723
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 5.0.0
> Reporter: Sarah Gilmore
> Priority: Minor
> Attachments: intmax32.parq, intmax32plus1.parq
>
>
> It's possible to create Parquet files containing row groups whose lengths are
> greater than int32 max (2147483647). However, Pyarrow cannot read these
> files.
> {code:java}
> >>> import pyarrow as pa
> >>> import pyarrow.parquet as pq
> # intmax32.parq can be read in without any issues
> >>> t = pq.read_table("intmax32.parq");
> $ intmax32plus1.parq cannot be read in
> >>> t = pq.read_table("intmax32plus1.parq");
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> File
> "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pyarrow/parquet.py",
> line 1895, in read_table
> return dataset.read(columns=columns, use_threads=use_threads,
> File
> "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pyarrow/parquet.py",
> line 1744, in read
> table = self._dataset.to_table(
> File "pyarrow/_dataset.pyx", line 465, in pyarrow._dataset.Dataset.to_table
> File "pyarrow/_dataset.pyx", line 3075, in pyarrow._dataset.Scanner.to_table
> File "pyarrow/error.pxi", line 143, in
> pyarrow.lib.pyarrow_internal_check_status
> File "pyarrow/error.pxi", line 114, in pyarrow.lib.check_status
> OSError: Negative size (corrupt file?)
> {code}
>
> However, both files can be imported via the C++ Arrow bindings without any
> issues.
>
--
This message was sent by Atlassian Jira
(v8.20.1#820001)