[ 
https://issues.apache.org/jira/browse/ARROW-3999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Diego Argueta updated ARROW-3999:
---------------------------------
    Description: 
I loaded a large Pandas DataFrame from a CSV and successfully wrote it to a 
Parquet file using the DataFrame's {{to_parquet}} method. However, reading that 
same file back results in an exception. The DataFrame consists of about 32 
million rows with seven columns; four are ASCII text and three are booleans.

 
{code:java}
>>> source_df.shape
(32070402, 7)

>>> source_df.dtypes
Url Source object
Url Destination object
Anchor text object
Follow / No-Follow object
Link No-Follow bool
Meta No-Follow bool
Robot No-Follow bool
dtype: object

>>> source_df.to_parquet('export.parq', compression='gzip',
 use_deprecated_int96_timestamps=True)

>>> loaded_df = pd.read_parquet('export.parq')
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
 File 
"/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pandas/io/parquet.py",
 line 288, in read_parquet
 return impl.read(path, columns=columns, **kwargs)
 File 
"/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pandas/io/parquet.py",
 line 131, in read
 **kwargs).to_pandas()
 File 
"/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pyarrow/parquet.py",
 line 1074, in read_table
 use_pandas_metadata=use_pandas_metadata)
 File 
"/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pyarrow/filesystem.py",
 line 184, in read_parquet
 use_pandas_metadata=use_pandas_metadata)
 File 
"/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pyarrow/parquet.py",
 line 943, in read
 use_pandas_metadata=use_pandas_metadata)
 File 
"/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pyarrow/parquet.py",
 line 500, in read
 table = reader.read(**options)
 File 
"/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pyarrow/parquet.py",
 line 187, in read
 use_threads=use_threads)
 File "pyarrow/_parquet.pyx", line 721, in 
pyarrow._parquet.ParquetReader.read_all
 File "pyarrow/error.pxi", line 83, in pyarrow.lib.check_status
pyarrow.lib.ArrowIOError: Arrow error: Capacity error: BinaryArray cannot 
contain more than 2147483646 bytes, have 2147483685

Arrow error: Capacity error: BinaryArray cannot contain more than 2147483646 
bytes, have 2147483685
 {code}
 

One would expect that if PyArrow can write a file successfully, it can read it 
back as well. Fortunately the {{fastparquet}} library has no problem reading 
this file, so we didn't lose any data, but the roundtripping problem was a bit 
of a surprise.

  was:
I loaded a large Pandas DataFrame from a CSV and successfully wrote it to a 
Parquet file using the DataFrame's {{to_parquet}} method. However, reading that 
same file back results in an exception:
{code:java}
>>> source_df.shape
(32070402, 7)

>>> source_df.dtypes
Url Source object
Url Destination object
Anchor text object
Follow / No-Follow object
Link No-Follow bool
Meta No-Follow bool
Robot No-Follow bool
dtype: object

>>> source_df.to_parquet('export.parq', compression='gzip',
 use_deprecated_int96_timestamps=True)

>>> loaded_df = pd.read_parquet('export.parq')
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
 File 
"/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pandas/io/parquet.py",
 line 288, in read_parquet
 return impl.read(path, columns=columns, **kwargs)
 File 
"/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pandas/io/parquet.py",
 line 131, in read
 **kwargs).to_pandas()
 File 
"/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pyarrow/parquet.py",
 line 1074, in read_table
 use_pandas_metadata=use_pandas_metadata)
 File 
"/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pyarrow/filesystem.py",
 line 184, in read_parquet
 use_pandas_metadata=use_pandas_metadata)
 File 
"/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pyarrow/parquet.py",
 line 943, in read
 use_pandas_metadata=use_pandas_metadata)
 File 
"/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pyarrow/parquet.py",
 line 500, in read
 table = reader.read(**options)
 File 
"/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pyarrow/parquet.py",
 line 187, in read
 use_threads=use_threads)
 File "pyarrow/_parquet.pyx", line 721, in 
pyarrow._parquet.ParquetReader.read_all
 File "pyarrow/error.pxi", line 83, in pyarrow.lib.check_status
pyarrow.lib.ArrowIOError: Arrow error: Capacity error: BinaryArray cannot 
contain more than 2147483646 bytes, have 2147483685

Arrow error: Capacity error: BinaryArray cannot contain more than 2147483646 
bytes, have 2147483685
 {code}
 

One would expect that if PyArrow can write a file successfully, it can read it 
back as well. Fortunately the {{fastparquet}} library has no problem reading 
this file, so we didn't lose any data, but the roundtripping problem was a bit 
of a surprise.


> [Python] Can't read large file that pyarrow wrote
> -------------------------------------------------
>
>                 Key: ARROW-3999
>                 URL: https://issues.apache.org/jira/browse/ARROW-3999
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.11.1
>         Environment: OS: OSX High Sierra 10.13.6
> Python: 3.7.0
> PyArrow: 0.11.1
> Pandas: 0.23.4
>            Reporter: Diego Argueta
>            Priority: Major
>
> I loaded a large Pandas DataFrame from a CSV and successfully wrote it to a 
> Parquet file using the DataFrame's {{to_parquet}} method. However, reading 
> that same file back results in an exception. The DataFrame consists of about 
> 32 million rows with seven columns; four are ASCII text and three are 
> booleans.
>  
> {code:java}
> >>> source_df.shape
> (32070402, 7)
> >>> source_df.dtypes
> Url Source object
> Url Destination object
> Anchor text object
> Follow / No-Follow object
> Link No-Follow bool
> Meta No-Follow bool
> Robot No-Follow bool
> dtype: object
> >>> source_df.to_parquet('export.parq', compression='gzip',
>  use_deprecated_int96_timestamps=True)
> >>> loaded_df = pd.read_parquet('export.parq')
> Traceback (most recent call last):
>  File "<stdin>", line 1, in <module>
>  File 
> "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pandas/io/parquet.py",
>  line 288, in read_parquet
>  return impl.read(path, columns=columns, **kwargs)
>  File 
> "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pandas/io/parquet.py",
>  line 131, in read
>  **kwargs).to_pandas()
>  File 
> "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pyarrow/parquet.py",
>  line 1074, in read_table
>  use_pandas_metadata=use_pandas_metadata)
>  File 
> "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pyarrow/filesystem.py",
>  line 184, in read_parquet
>  use_pandas_metadata=use_pandas_metadata)
>  File 
> "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pyarrow/parquet.py",
>  line 943, in read
>  use_pandas_metadata=use_pandas_metadata)
>  File 
> "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pyarrow/parquet.py",
>  line 500, in read
>  table = reader.read(**options)
>  File 
> "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pyarrow/parquet.py",
>  line 187, in read
>  use_threads=use_threads)
>  File "pyarrow/_parquet.pyx", line 721, in 
> pyarrow._parquet.ParquetReader.read_all
>  File "pyarrow/error.pxi", line 83, in pyarrow.lib.check_status
> pyarrow.lib.ArrowIOError: Arrow error: Capacity error: BinaryArray cannot 
> contain more than 2147483646 bytes, have 2147483685
> Arrow error: Capacity error: BinaryArray cannot contain more than 2147483646 
> bytes, have 2147483685
>  {code}
>  
> One would expect that if PyArrow can write a file successfully, it can read 
> it back as well. Fortunately the {{fastparquet}} library has no problem 
> reading this file, so we didn't lose any data, but the roundtripping problem 
> was a bit of a surprise.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to