[jira] [Commented] (ARROW-3999) [Python] Can't read large file that pyarrow wrote

Rok Mihevc (Jira) Tue, 10 Jan 2023 23:46:14 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-3999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17661022#comment-17661022
 ]


Rok Mihevc commented on ARROW-3999:
-----------------------------------

This issue has been migrated to [issue 
#20601|https://github.com/apache/arrow/issues/20601] on GitHub. Please see the 
[migration documentation|https://github.com/apache/arrow/issues/14542] for 
further details.

> [Python] Can't read large file that pyarrow wrote
> -------------------------------------------------
>
>                 Key: ARROW-3999
>                 URL: https://issues.apache.org/jira/browse/ARROW-3999
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.11.1
>         Environment: OS: OSX High Sierra 10.13.6
> Python: 3.7.0
> PyArrow: 0.11.1
> Pandas: 0.23.4
>            Reporter: Diego Argueta
>            Priority: Major
>
> I loaded a large Pandas DataFrame from a CSV and successfully wrote it to a 
> Parquet file using the DataFrame's {{to_parquet}} method. However, reading 
> that same file back results in an exception. The DataFrame consists of about 
> 32 million rows with seven columns; four are ASCII text and three are 
> booleans.
>  
> {code:java}
> >>> source_df.shape
> (32070402, 7)
> >>> source_df.dtypes
> Url Source object
> Url Destination object
> Anchor text object
> Follow / No-Follow object
> Link No-Follow bool
> Meta No-Follow bool
> Robot No-Follow bool
> dtype: object
> >>> source_df.to_parquet('export.parq', compression='gzip',
>                          use_deprecated_int96_timestamps=True)
> >>> loaded_df = pd.read_parquet('export.parq')
> Traceback (most recent call last):
>  File "<stdin>", line 1, in <module>
>  File 
> "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pandas/io/parquet.py",
>  line 288, in read_parquet
>    return impl.read(path, columns=columns, **kwargs)
>  File 
> "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pandas/io/parquet.py",
>  line 131, in read
>    **kwargs).to_pandas()
>  File 
> "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pyarrow/parquet.py",
>  line 1074, in read_table
>    use_pandas_metadata=use_pandas_metadata)
>  File 
> "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pyarrow/filesystem.py",
>  line 184, in read_parquet
>    use_pandas_metadata=use_pandas_metadata)
>  File 
> "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pyarrow/parquet.py",
>  line 943, in read
>    use_pandas_metadata=use_pandas_metadata)
>  File 
> "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pyarrow/parquet.py",
>  line 500, in read
>    table = reader.read(**options)
>  File 
> "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pyarrow/parquet.py",
>  line 187, in read
>    use_threads=use_threads)
>  File "pyarrow/_parquet.pyx", line 721, in 
> pyarrow._parquet.ParquetReader.read_all
>  File "pyarrow/error.pxi", line 83, in pyarrow.lib.check_status
> pyarrow.lib.ArrowIOError: Arrow error: Capacity error: BinaryArray cannot 
> contain more than 2147483646 bytes, have 2147483685
> Arrow error: Capacity error: BinaryArray cannot contain more than 2147483646 
> bytes, have 2147483685
>  {code}
>  
> One would expect that if PyArrow can write a file successfully, it can read 
> it back as well. Fortunately the {{fastparquet}} library has no problem 
> reading this file, so we didn't lose any data, but the roundtripping problem 
> was a bit of a surprise.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-3999) [Python] Can't read large file that pyarrow wrote

Reply via email to