Sep Dehpour created ARROW-9474: ---------------------------------- Summary: Column type inference in read_csv vs. open_csv. CSV conversion error to null. Key: ARROW-9474 URL: https://issues.apache.org/jira/browse/ARROW-9474 Project: Apache Arrow Issue Type: Bug Reporter: Sep Dehpour
The open_csv stream does not adjust the inferred column type based on the new data seen in new blocks. For example if a csv has null values in the first few blocks of open_csv reader, the column is inferred as Null type. As PyArrow iterates over blocks and sees non null values in that column, it crashes. Example Error: {code:java} pyarrow.lib.ArrowInvalid: In CSV column #44: CSV conversion error to null: invalid value '-176400' {code} This problem is resolved if a read_option with a huge block size is passed to the open_csv. But that negates the whole point of having a stream vs. read_csv. System info: PyArrow 0.17.1, Mac OS Catalina, Python 3.7.4 -- This message was sent by Atlassian Jira (v8.3.4#803005)