Frédérique Vanneste created ARROW-3139:
------------------------------------------
Summary: ArrowIOError: Arrow error: Capacity error during read
Key: ARROW-3139
URL: https://issues.apache.org/jira/browse/ARROW-3139
Project: Apache Arrow
Issue Type: Bug
Components: Python
Affects Versions: 0.10.0
Environment: pandas=0.23.1=py36h637b7d7_0
pyarrow==0.10.0
Reporter: Frédérique Vanneste
My assumption: the problem is caused by a large object column containing
strings up to 27 characters long. (so that column is much larger than 2GB of
strings, chunking issue)
looks similar as
https://issues.apache.org/jira/browse/ARROW-2227?focusedCommentId=16379574&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16379574
Code
* basket_plateau= pq.read_table("basket_plateau.parquet")
* basket_plateau = pd.read_parquet("basket_plateau.parquet")
Error produced
* ArrowIOError: Arrow error: Capacity error: BinaryArray cannot contain more
than 2147483646 bytes, have 2147483655
Dataset
* Pandas dataframe (pandas=0.23.1=py36h637b7d7_0)
* 2.7 billion record, 4 columns ( int64/object/datetime64/float64)
* aprox 90GB in memory
* example of object col: "Fresh Vegetables", "Alcohol Beers", ... (thing food
retail categories)
History to bug:
* was using older version of pyarrow
* tried writing dataset to disk (parquet) and failed
* stumbled on https://issues.apache.org/jira/browse/ARROW-2227
* upgraded to 0.10
* tried writing dataset to disk (parquet) and succeeded
* tried reading dataset and failed
* looks like a similar case as:
https://issues.apache.org/jira/browse/ARROW-2227?focusedCommentId=16379574&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16379574
Code outputs
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)