[
https://issues.apache.org/jira/browse/ARROW-3139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Wes McKinney closed ARROW-3139.
-------------------------------
Resolution: Duplicate
duplicate of ARROW-3762 (formerly PARQUET-1239)
> [Python] ArrowIOError: Arrow error: Capacity error during read
> --------------------------------------------------------------
>
> Key: ARROW-3139
> URL: https://issues.apache.org/jira/browse/ARROW-3139
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 0.10.0
> Environment: pandas=0.23.1=py36h637b7d7_0
> pyarrow==0.10.0
> Reporter: Frédérique Vanneste
> Priority: Major
>
> My assumption: the problem is caused by a large object column containing
> strings up to 27 characters long. (so that column is much larger than 2GB of
> strings, chunking issue)
> looks similar as
> https://issues.apache.org/jira/browse/ARROW-2227?focusedCommentId=16379574&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16379574
>
> Code
> * basket_plateau= pq.read_table("basket_plateau.parquet")
> * basket_plateau = pd.read_parquet("basket_plateau.parquet")
> Error produced
> * ArrowIOError: Arrow error: Capacity error: BinaryArray cannot contain more
> than 2147483646 bytes, have 2147483655
> Dataset
> * Pandas dataframe (pandas=0.23.1=py36h637b7d7_0)
> * 2.7 billion record, 4 columns ( int64/object/datetime64/float64)
> * aprox 90GB in memory
> * example of object col: "Fresh Vegetables", "Alcohol Beers", ... (think
> food retail categories)
> History to bug:
> * was using older version of pyarrow
> * tried writing dataset to disk (parquet) and failed
> * stumbled on https://issues.apache.org/jira/browse/ARROW-2227
> * upgraded to 0.10
> * tried writing dataset to disk (parquet) and succeeded
> * tried reading dataset and failed
> * looks like a similar case as:
> https://issues.apache.org/jira/browse/ARROW-2227?focusedCommentId=16379574&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16379574
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)