[ 
https://issues.apache.org/jira/browse/ARROW-9818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17181936#comment-17181936
 ] 

Nolo Ogbirner commented on ARROW-9818:
--------------------------------------

 Will try to reproduce locally ASAP. Have tried with max 512 MB and 1024 MB 
memory for AWS Lambda. The version I’m using was built from source with S3 but 
without Gandiva to get a dependency that fits in Lambda. It is then uploaded as 
an unzipped wheel as a dependency. 

> [Python] Obscure C++ Error when Calling to_pandas on a RecordBatch
> ------------------------------------------------------------------
>
>                 Key: ARROW-9818
>                 URL: https://issues.apache.org/jira/browse/ARROW-9818
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 1.0.0
>         Environment: AWS Lambda with pyarrow 1.0.0
>            Reporter: Nolo Ogbirner
>            Priority: Critical
>
> I'm using Pyarrow to stream a CSV from an input over HTTP and then converting 
> each RecordBatch to a Pandas DataFrame for manipulation. For testing, I'm 
> using the NYPD Motor Vehicle Collisions Open source dataset. However, for 
> anything above the 5MB file e.g. 1GB, 240MB, my code that is running in an 
> AWS Lambda is failing with a RuntimeError because of
> terminate called after throwing an instance of 'std::logic_error'
>  what(): basic_string::_S_construct null not valid
> after calling to_pandas() on the first batch. Why is this happening? How can 
> I fix it? This happened when some 7 of the 28 columns were inferred to be of 
> type null, so I instead set strings_can_be_null=True on my ReadOptions for 
> CSV reading and provided a schema that forced the null columns to be strings. 
> This didn't work. I suspect it has something to do with the size of the file, 
> but am unsure.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to