Dennis Waldron created ARROW-5419:
-------------------------------------
Summary: [C++] CSV strings_can_be_null option doesn't respect all
null_values
Key: ARROW-5419
URL: https://issues.apache.org/jira/browse/ARROW-5419
Project: Apache Arrow
Issue Type: Bug
Components: C++, Python
Environment: Python 3.6.8
PyArrow 0.13.1.dev225+g184b8deb
NumPy 1.16.3
Pandas 0.24.2
Reporter: Dennis Waldron
Relates to https://issues.apache.org/jira/browse/ARROW-5195 and
[https://github.com/apache/arrow/issues/4184]
I was testing the new *strings_can_be_null* ConvertOption (built from git
184b8deb651c6f6308c0fa2a595f5a40f5da8ce8) in conjunction with the CSV reader
and noted that when enabled and an empty string is parsed that it doesn't
return NULL despite '' being in the default null_values list
([https://github.com/apache/arrow/blob/f7ef65e5fc367f1f5649dfcea0754e413fcca394/cpp/src/arrow/csv/options.cc#L28)]
{code:java}
options.null_values = {"", "#N/A", "#N/A N/A", "#NA", "-1.#IND", "-1.#QNAN",
"-NaN", "-nan", "1.#IND", "1.#QNAN", "N/A", "NA",
"NULL", "NaN", "n/a", "nan", "null"};
{code}
Given that the *strings_can_be_null* option was added to expose the same NULL
processing functionality with respect to strings as *pandas.read_csv,* I
believe that it should also be able to handle empty strings. **
In Pandas:
{code:java}
content = b"a,b\n1,null\n2,\n3,test"
df = pd.read_csv(io.BytesIO(content))
print(df)
a b
0 1 NaN
1 2 NaN
2 3 test
{code}
In PyArrow:
{code:java}
convert_options = pc.ConvertOptions(strings_can_be_null=True)
table = pc.read_csv(io.BytesIO(content), convert_options=convert_options)
print(table.to_pydict())
OrderedDict([('a', [1, 2, 3]), ('b', [None, '', 'test'])])
{code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)