[
https://issues.apache.org/jira/browse/ARROW-9818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Joris Van den Bossche closed ARROW-9818.
----------------------------------------
Resolution: Cannot Reproduce
> [Python] Obscure C++ Error when Calling to_pandas on a RecordBatch
> ------------------------------------------------------------------
>
> Key: ARROW-9818
> URL: https://issues.apache.org/jira/browse/ARROW-9818
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 1.0.0
> Environment: AWS Lambda with pyarrow 1.0.0
> Reporter: Nolo Ogbirner
> Priority: Critical
>
> I'm using Pyarrow to stream a CSV from an input over HTTP and then converting
> each RecordBatch to a Pandas DataFrame for manipulation. For testing, I'm
> using the NYPD Motor Vehicle Collisions Open source dataset. However, for
> anything above the 5MB file e.g. 1GB, 240MB, my code that is running in an
> AWS Lambda is failing with a RuntimeError because of
> terminate called after throwing an instance of 'std::logic_error'
> what(): basic_string::_S_construct null not valid
> after calling to_pandas() on the first batch. Why is this happening? How can
> I fix it? This happened when some 7 of the 28 columns were inferred to be of
> type null, so I instead set strings_can_be_null=True on my ReadOptions for
> CSV reading and provided a schema that forced the null columns to be strings.
> This didn't work. I suspect it has something to do with the size of the file,
> but am unsure.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)