[
https://issues.apache.org/jira/browse/ARROW-3999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17661022#comment-17661022
]
Rok Mihevc commented on ARROW-3999:
-----------------------------------
This issue has been migrated to [issue
#20601|https://github.com/apache/arrow/issues/20601] on GitHub. Please see the
[migration documentation|https://github.com/apache/arrow/issues/14542] for
further details.
> [Python] Can't read large file that pyarrow wrote
> -------------------------------------------------
>
> Key: ARROW-3999
> URL: https://issues.apache.org/jira/browse/ARROW-3999
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 0.11.1
> Environment: OS: OSX High Sierra 10.13.6
> Python: 3.7.0
> PyArrow: 0.11.1
> Pandas: 0.23.4
> Reporter: Diego Argueta
> Priority: Major
>
> I loaded a large Pandas DataFrame from a CSV and successfully wrote it to a
> Parquet file using the DataFrame's {{to_parquet}} method. However, reading
> that same file back results in an exception. The DataFrame consists of about
> 32 million rows with seven columns; four are ASCII text and three are
> booleans.
>
> {code:java}
> >>> source_df.shape
> (32070402, 7)
> >>> source_df.dtypes
> Url Source object
> Url Destination object
> Anchor text object
> Follow / No-Follow object
> Link No-Follow bool
> Meta No-Follow bool
> Robot No-Follow bool
> dtype: object
> >>> source_df.to_parquet('export.parq', compression='gzip',
> use_deprecated_int96_timestamps=True)
> >>> loaded_df = pd.read_parquet('export.parq')
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> File
> "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pandas/io/parquet.py",
> line 288, in read_parquet
> return impl.read(path, columns=columns, **kwargs)
> File
> "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pandas/io/parquet.py",
> line 131, in read
> **kwargs).to_pandas()
> File
> "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pyarrow/parquet.py",
> line 1074, in read_table
> use_pandas_metadata=use_pandas_metadata)
> File
> "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pyarrow/filesystem.py",
> line 184, in read_parquet
> use_pandas_metadata=use_pandas_metadata)
> File
> "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pyarrow/parquet.py",
> line 943, in read
> use_pandas_metadata=use_pandas_metadata)
> File
> "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pyarrow/parquet.py",
> line 500, in read
> table = reader.read(**options)
> File
> "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pyarrow/parquet.py",
> line 187, in read
> use_threads=use_threads)
> File "pyarrow/_parquet.pyx", line 721, in
> pyarrow._parquet.ParquetReader.read_all
> File "pyarrow/error.pxi", line 83, in pyarrow.lib.check_status
> pyarrow.lib.ArrowIOError: Arrow error: Capacity error: BinaryArray cannot
> contain more than 2147483646 bytes, have 2147483685
> Arrow error: Capacity error: BinaryArray cannot contain more than 2147483646
> bytes, have 2147483685
> {code}
>
> One would expect that if PyArrow can write a file successfully, it can read
> it back as well. Fortunately the {{fastparquet}} library has no problem
> reading this file, so we didn't lose any data, but the roundtripping problem
> was a bit of a surprise.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)