[ 
https://issues.apache.org/jira/browse/ARROW-3762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Kietzman updated ARROW-3762:
-------------------------------------
    Description: 
# When reading a parquet file with binary data > 2 GiB, we get an ArrowIOError 
due to it not creating chunked arrays. Reading each row group individually and 
then concatenating the tables works, however.

 
{code:java}
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq


x = pa.array(list('1' * 2**30))

demo = 'demo.parquet'


def scenario():
    t = pa.Table.from_arrays([x], ['x'])
    writer = pq.ParquetWriter(demo, t.schema)
    for i in range(2):
        writer.write_table(t)
    writer.close()

    pf = pq.ParquetFile(demo)

    # pyarrow.lib.ArrowIOError: Arrow error: Invalid: BinaryArray cannot 
contain more than 2147483646 bytes, have 2147483647
    t2 = pf.read()

    # Works, but note, there are 32 row groups, not 2 as suggested by:
    # 
https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing
    tables = [pf.read_row_group(i) for i in range(pf.num_row_groups)]
    t3 = pa.concat_tables(tables)

scenario()
{code}

  was:
When reading a parquet file with binary data > 2 GiB, we get an ArrowIOError 
due to it not creating chunked arrays. Reading each row group individually and 
then concatenating the tables works, however.

 
{code:java}
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq


x = pa.array(list('1' * 2**30))

demo = 'demo.parquet'


def scenario():
    t = pa.Table.from_arrays([x], ['x'])
    writer = pq.ParquetWriter(demo, t.schema)
    for i in range(2):
        writer.write_table(t)
    writer.close()

    pf = pq.ParquetFile(demo)

    # pyarrow.lib.ArrowIOError: Arrow error: Invalid: BinaryArray cannot 
contain more than 2147483646 bytes, have 2147483647
    t2 = pf.read()

    # Works, but note, there are 32 row groups, not 2 as suggested by:
    # 
https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing
    tables = [pf.read_row_group(i) for i in range(pf.num_row_groups)]
    t3 = pa.concat_tables(tables)

scenario()
{code}


> [C++] Parquet arrow::Table reads error when overflowing capacity of 
> BinaryArray
> -------------------------------------------------------------------------------
>
>                 Key: ARROW-3762
>                 URL: https://issues.apache.org/jira/browse/ARROW-3762
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python
>            Reporter: Chris Ellison
>            Assignee: Benjamin Kietzman
>            Priority: Major
>              Labels: parquet, pull-request-available
>             Fix For: 0.14.0, 0.15.0
>
>          Time Spent: 8h 10m
>  Remaining Estimate: 0h
>
> # When reading a parquet file with binary data > 2 GiB, we get an 
> ArrowIOError due to it not creating chunked arrays. Reading each row group 
> individually and then concatenating the tables works, however.
>  
> {code:java}
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> x = pa.array(list('1' * 2**30))
> demo = 'demo.parquet'
> def scenario():
>     t = pa.Table.from_arrays([x], ['x'])
>     writer = pq.ParquetWriter(demo, t.schema)
>     for i in range(2):
>         writer.write_table(t)
>     writer.close()
>     pf = pq.ParquetFile(demo)
>     # pyarrow.lib.ArrowIOError: Arrow error: Invalid: BinaryArray cannot 
> contain more than 2147483646 bytes, have 2147483647
>     t2 = pf.read()
>     # Works, but note, there are 32 row groups, not 2 as suggested by:
>     # 
> https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing
>     tables = [pf.read_row_group(i) for i in range(pf.num_row_groups)]
>     t3 = pa.concat_tables(tables)
> scenario()
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

Reply via email to