[ 
https://issues.apache.org/jira/browse/ARROW-9818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche closed ARROW-9818.
----------------------------------------
    Resolution: Cannot Reproduce

> [Python] Obscure C++ Error when Calling to_pandas on a RecordBatch
> ------------------------------------------------------------------
>
>                 Key: ARROW-9818
>                 URL: https://issues.apache.org/jira/browse/ARROW-9818
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 1.0.0
>         Environment: AWS Lambda with pyarrow 1.0.0
>            Reporter: Nolo Ogbirner
>            Priority: Critical
>
> I'm using Pyarrow to stream a CSV from an input over HTTP and then converting 
> each RecordBatch to a Pandas DataFrame for manipulation. For testing, I'm 
> using the NYPD Motor Vehicle Collisions Open source dataset. However, for 
> anything above the 5MB file e.g. 1GB, 240MB, my code that is running in an 
> AWS Lambda is failing with a RuntimeError because of
> terminate called after throwing an instance of 'std::logic_error'
>  what(): basic_string::_S_construct null not valid
> after calling to_pandas() on the first batch. Why is this happening? How can 
> I fix it? This happened when some 7 of the 28 columns were inferred to be of 
> type null, so I instead set strings_can_be_null=True on my ReadOptions for 
> CSV reading and provided a schema that forced the null columns to be strings. 
> This didn't work. I suspect it has something to do with the size of the file, 
> but am unsure.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to