[
https://issues.apache.org/jira/browse/ARROW-9818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Nolo Ogbirner updated ARROW-9818:
---------------------------------
Description:
I'm using Pyarrow to stream a CSV from an input over HTTP and then converting
each RecordBatch to a Pandas DataFrame for manipulation. For testing, I'm using
the NYPD Motor Vehicle Collisions Open source dataset. However, for anything
above the 5MB file e.g. 1GB, 240MB, my code that is running in an AWS Lambda is
failing with a RuntimeError because of
terminate called after throwing an instance of 'std::logic_error'
what(): basic_string::_S_construct null not valid
after calling to_pandas() on the first batch. Why is this happening? How can I
fix it? This happened when some 7 of the 28 columns were inferred to be of type
null, so I instead set strings_can_be_null=True on my ReadOptions for CSV
reading and provided a schema that forced the null columns to be strings. This
didn't work.
was:
I'm using Pyarrow to stream a CSV from an input over HTTP and then converting
each RecordBatch to a Pandas DataFrame for manipulation. For testing, I'm using
the NYPD Motor Vehicle Collisions Open source dataset. However, for anything
above the 5MB file e.g. 1GB, 240MB, my code that is running in an AWS Lambda is
failing with a RuntimeError because of
terminate called after throwing an instance of 'std::logic_error'
what(): basic_string::_S_construct null not valid
after calling to_pandas() on the first batch. Why is this happening? How can I
fix it?
> Obscure C++ Error when Calling to_pandas on a RecordBatch
> ---------------------------------------------------------
>
> Key: ARROW-9818
> URL: https://issues.apache.org/jira/browse/ARROW-9818
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 1.0.0
> Environment: AWS Lambda with pyarrow 1.0.0
> Reporter: Nolo Ogbirner
> Priority: Critical
>
> I'm using Pyarrow to stream a CSV from an input over HTTP and then converting
> each RecordBatch to a Pandas DataFrame for manipulation. For testing, I'm
> using the NYPD Motor Vehicle Collisions Open source dataset. However, for
> anything above the 5MB file e.g. 1GB, 240MB, my code that is running in an
> AWS Lambda is failing with a RuntimeError because of
> terminate called after throwing an instance of 'std::logic_error'
> what(): basic_string::_S_construct null not valid
> after calling to_pandas() on the first batch. Why is this happening? How can
> I fix it? This happened when some 7 of the 28 columns were inferred to be of
> type null, so I instead set strings_can_be_null=True on my ReadOptions for
> CSV reading and provided a schema that forced the null columns to be strings.
> This didn't work.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)