[
https://issues.apache.org/jira/browse/ARROW-5419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Rok Mihevc updated ARROW-5419:
------------------------------
External issue URL: https://github.com/apache/arrow/issues/21872
> [C++] CSV strings_can_be_null option doesn't respect all null_values
> --------------------------------------------------------------------
>
> Key: ARROW-5419
> URL: https://issues.apache.org/jira/browse/ARROW-5419
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++, Python
> Environment: Python 3.6.8
> PyArrow 0.13.1.dev225+g184b8deb
> NumPy 1.16.3
> Pandas 0.24.2
> Reporter: Dennis Waldron
> Assignee: Antoine Pitrou
> Priority: Minor
> Labels: csv, pull-request-available
> Fix For: 0.14.0
>
> Time Spent: 0.5h
> Remaining Estimate: 0h
>
> Relates to ARROW-5195 and [https://github.com/apache/arrow/issues/4184]
> I was testing the new *strings_can_be_null* ConvertOption (built from git
> 184b8deb651c6f6308c0fa2a595f5a40f5da8ce8) in conjunction with the CSV reader
> and noted that when enabled and an empty string is parsed that it doesn't
> return NULL despite '' being in the default null_values list
> ([https://github.com/apache/arrow/blob/f7ef65e5fc367f1f5649dfcea0754e413fcca394/cpp/src/arrow/csv/options.cc#L28)]
> {code:java}
> options.null_values = {"", "#N/A", "#N/A N/A", "#NA", "-1.#IND", "-1.#QNAN",
> "-NaN", "-nan", "1.#IND", "1.#QNAN", "N/A", "NA",
> "NULL", "NaN", "n/a", "nan", "null"};
> {code}
> Given that the *strings_can_be_null* option was added to expose the same NULL
> processing functionality with respect to strings as *pandas.read_csv,* I
> believe that it should also be able to handle empty strings. **
> In Pandas:
> {code:java}
> content = b"a,b\n1,null\n2,\n3,test"
> df = pd.read_csv(io.BytesIO(content))
> print(df)
> a b
> 0 1 NaN
> 1 2 NaN
> 2 3 test
> {code}
> In PyArrow:
> {code:java}
> convert_options = pc.ConvertOptions(strings_can_be_null=True)
> table = pc.read_csv(io.BytesIO(content), convert_options=convert_options)
> print(table.to_pydict())
> OrderedDict([('a', [1, 2, 3]), ('b', [None, '', 'test'])])
> {code}
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)