[jira] [Commented] (ARROW-3762) [C++] Parquet arrow::Table reads error when overflowing capacity of BinaryArray

Igor Yastrebov (JIRA) Wed, 14 Aug 2019 03:15:40 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-3762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16907129#comment-16907129
 ]


Igor Yastrebov commented on ARROW-3762:
---------------------------------------

It seems the issue comes from pandas boolean indexing. CSV file with data 
(~400000000 lines, couldn't reduce without missing the error): [Google 
Drive|https://drive.google.com/file/d/1QMwxl4tgo8W-wOL1ih4nXV2vzRq2NaVW/view?usp=sharing]

Reproduction code:
{code:java}
>>> import pandas as pd
>>> tst = pd.read_csv('test.csv', dtype = {'col1': 'float32', 'col2': 'str'})
>>> tst = tst[~tst.col1.isnull()]
>>> tst.to_parquet('test.parquet', engine = 'pyarrow', index = False)
>>> tst = pd.read_parquet('test.parquet')
{code}
Conda environment reproduction:
{code:java}
conda install python=3.7 pandas=0.25.0 pyarrow=0.14.1 -c conda-forge
{code}

> [C++] Parquet arrow::Table reads error when overflowing capacity of 
> BinaryArray
> -------------------------------------------------------------------------------
>
>                 Key: ARROW-3762
>                 URL: https://issues.apache.org/jira/browse/ARROW-3762
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python
>            Reporter: Chris Ellison
>            Assignee: Benjamin Kietzman
>            Priority: Major
>              Labels: parquet, pull-request-available
>             Fix For: 0.14.0, 0.15.0
>
>          Time Spent: 8h 10m
>  Remaining Estimate: 0h
>
> When reading a parquet file with binary data > 2 GiB, we get an ArrowIOError 
> due to it not creating chunked arrays. Reading each row group individually 
> and then concatenating the tables works, however.
>  
> {code:java}
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> x = pa.array(list('1' * 2**30))
> demo = 'demo.parquet'
> def scenario():
>     t = pa.Table.from_arrays([x], ['x'])
>     writer = pq.ParquetWriter(demo, t.schema)
>     for i in range(2):
>         writer.write_table(t)
>     writer.close()
>     pf = pq.ParquetFile(demo)
>     # pyarrow.lib.ArrowIOError: Arrow error: Invalid: BinaryArray cannot 
> contain more than 2147483646 bytes, have 2147483647
>     t2 = pf.read()
>     # Works, but note, there are 32 row groups, not 2 as suggested by:
>     # 
> https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing
>     tables = [pf.read_row_group(i) for i in range(pf.num_row_groups)]
>     t3 = pa.concat_tables(tables)
> scenario()
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (ARROW-3762) [C++] Parquet arrow::Table reads error when overflowing capacity of BinaryArray

Reply via email to