[ 
https://issues.apache.org/jira/browse/ARROW-9818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nolo Ogbirner updated ARROW-9818:
---------------------------------
    Description: 
I'm using Pyarrow to stream a CSV from an input over HTTP and then converting 
each RecordBatch to a Pandas DataFrame for manipulation. For testing, I'm using 
the NYPD Motor Vehicle Collisions Open source dataset. However, for anything 
above the 5MB file e.g. 1GB, 240MB, my code that is running in an AWS Lambda is 
failing with a RuntimeError because of

terminate called after throwing an instance of 'std::logic_error'
 what(): basic_string::_S_construct null not valid

after calling to_pandas() on the first batch. Why is this happening? How can I 
fix it? This happened when some 7 of the 28 columns were inferred to be of type 
null, so I instead set strings_can_be_null=True on my ReadOptions for CSV 
reading and provided a schema that forced the null columns to be strings. This 
didn't work. I suspect it has something to do with the size of the file, but am 
unsure.

  was:
I'm using Pyarrow to stream a CSV from an input over HTTP and then converting 
each RecordBatch to a Pandas DataFrame for manipulation. For testing, I'm using 
the NYPD Motor Vehicle Collisions Open source dataset. However, for anything 
above the 5MB file e.g. 1GB, 240MB, my code that is running in an AWS Lambda is 
failing with a RuntimeError because of

terminate called after throwing an instance of 'std::logic_error'
 what(): basic_string::_S_construct null not valid

after calling to_pandas() on the first batch. Why is this happening? How can I 
fix it? This happened when some 7 of the 28 columns were inferred to be of type 
null, so I instead set strings_can_be_null=True on my ReadOptions for CSV 
reading and provided a schema that forced the null columns to be strings. This 
didn't work.


> Obscure C++ Error when Calling to_pandas on a RecordBatch
> ---------------------------------------------------------
>
>                 Key: ARROW-9818
>                 URL: https://issues.apache.org/jira/browse/ARROW-9818
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 1.0.0
>         Environment: AWS Lambda with pyarrow 1.0.0
>            Reporter: Nolo Ogbirner
>            Priority: Critical
>
> I'm using Pyarrow to stream a CSV from an input over HTTP and then converting 
> each RecordBatch to a Pandas DataFrame for manipulation. For testing, I'm 
> using the NYPD Motor Vehicle Collisions Open source dataset. However, for 
> anything above the 5MB file e.g. 1GB, 240MB, my code that is running in an 
> AWS Lambda is failing with a RuntimeError because of
> terminate called after throwing an instance of 'std::logic_error'
>  what(): basic_string::_S_construct null not valid
> after calling to_pandas() on the first batch. Why is this happening? How can 
> I fix it? This happened when some 7 of the 28 columns were inferred to be of 
> type null, so I instead set strings_can_be_null=True on my ReadOptions for 
> CSV reading and provided a schema that forced the null columns to be strings. 
> This didn't work. I suspect it has something to do with the size of the file, 
> but am unsure.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to