[ 
https://issues.apache.org/jira/browse/ARROW-5419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-5419.
------------------------------------
    Resolution: Fixed

Issue resolved by pull request 4396
[https://github.com/apache/arrow/pull/4396]

> [C++] CSV strings_can_be_null option doesn't respect all null_values
> --------------------------------------------------------------------
>
>                 Key: ARROW-5419
>                 URL: https://issues.apache.org/jira/browse/ARROW-5419
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python
>         Environment: Python 3.6.8
> PyArrow 0.13.1.dev225+g184b8deb
> NumPy 1.16.3
> Pandas 0.24.2
>            Reporter: Dennis Waldron
>            Assignee: Antoine Pitrou
>            Priority: Minor
>              Labels: csv, pull-request-available
>             Fix For: 0.14.0
>
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Relates to ARROW-5195 and [https://github.com/apache/arrow/issues/4184]
> I was testing the new *strings_can_be_null* ConvertOption (built from git 
> 184b8deb651c6f6308c0fa2a595f5a40f5da8ce8) in conjunction with the CSV reader 
> and noted that when enabled and an empty string is parsed that it doesn't 
> return NULL despite '' being in the default null_values list 
> ([https://github.com/apache/arrow/blob/f7ef65e5fc367f1f5649dfcea0754e413fcca394/cpp/src/arrow/csv/options.cc#L28)]
> {code:java}
> options.null_values = {"", "#N/A", "#N/A N/A", "#NA", "-1.#IND", "-1.#QNAN",
> "-NaN", "-nan", "1.#IND", "1.#QNAN", "N/A", "NA",
> "NULL", "NaN", "n/a", "nan", "null"};
> {code}
> Given that the *strings_can_be_null* option was added to expose the same NULL 
> processing functionality with respect to strings as *pandas.read_csv,* I 
> believe that it should also be able to handle empty strings. ** 
> In Pandas:
> {code:java}
> content = b"a,b\n1,null\n2,\n3,test"
> df = pd.read_csv(io.BytesIO(content))
> print(df)
>    a     b
> 0  1   NaN
> 1  2   NaN
> 2  3  test
> {code}
> In PyArrow:
> {code:java}
> convert_options = pc.ConvertOptions(strings_can_be_null=True)
> table = pc.read_csv(io.BytesIO(content), convert_options=convert_options)
> print(table.to_pydict())
> OrderedDict([('a', [1, 2, 3]), ('b', [None, '', 'test'])])
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to