[ 
https://issues.apache.org/jira/browse/ARROW-12959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ian Cook updated ARROW-12959:
-----------------------------
    Description: 
(This is the flip side of ARROW-12960.)

Currently the Arrow compute kernel {{is_null}} always treats {{NaN}} as a 
non-missing value, returning {{false}} at positions of the input datum with 
value {{NaN}}.

It would be helpful to be able to control this behavior with an option. The 
option could be named {{nan_is_null}} or something similar.  It would default 
to {{false}}, consistent with current behavior. When set to {{true}}, it should 
check if the input datum has a floating point data type, and if so, return 
{{true}} at positions where the input is {{NaN}}. If the input datum has some 
other type, the option should be silently ignored.

Among other things, this would enable the {{arrow}} R package to evaluate 
{{is.na()}} consistently with the way base R does. In base R, {{is.na()}} 
returns {{TRUE}} on {{NaN}}. But in the {{arrow}} R package, it returns 
{{FALSE}}:
{code:r}
is.na(c(3.14, NA, NaN))
## [1] FALSE TRUE TRUE

as.vector(is.na(Array$create(c(3.14, NA, NaN))))
## [1] FALSE TRUE FALSE{code}
I think solving this with an option in the C++ kernel is the best solution, 
because I suspect there are other cases in which users might want to treat 
{{NaN}} as a missing value. However, it would also be possible to solve this 
just in the R package, by defining a mapping of {{is.na}} in the R package that 
checks if the input {{x}} has a floating point data type, and if so, evaluates 
{{is.na\(x\) | is.nan\(x\)}}. If we choose to go that route, we should change 
this Jira issue summary to "[R] Make is.na(NaN) consistent with base R".

  was:
(This is the flip side of ARROW-12960.)

Currently the Arrow compute kernel {{is_null}} always treats {{NaN}} as a 
non-missing value, returning {{false}} at positions of the input datum with 
value {{NaN}}.

It would be helpful to be able to control this behavior with an option. The 
option could be named {{nan_is_null}} or something similar.  It would default 
to {{false}}, consistent with current behavior. When set to {{true}}, it should 
check if the input datum has a floating point data type, and if so, return 
{{true}} at positions where the input is {{NaN}}. If the input datum has some 
other type, the option should be silently ignored.

Among other things, this would enable the {{arrow}} R package to evaluate 
{{is.na()}} consistently with the way base R does. In base R, {{is.na()}} 
returns {{TRUE}} on {{NaN}}. But in the {{arrow}} R package, it returns 
{{FALSE}}:
{code:r}
is.na(c(3.14, NA, NaN))
## [1] FALSE TRUE TRUE

as.vector(is.na(Array$create(c(3.14, NA, NaN))))
## [1] FALSE TRUE FALSE{code}
I think solving this with an option in the C++ kernel is the best solution, 
because I suspect there are other cases in which users might want to treat 
{{NaN}} as a missing value. However, it would also be possible to solve this 
just in the R package, by defining a mapping of {{is.na}} in the R package that 
checks if the input {{x}} has a floating point data type, and if so, evaluates 
{{is.na(x) | is.nan(x)}}. If we choose to go that route, we should change this 
Jira issue summary to "[R] Make is.na(NaN) consistent with base R".


> [C++][R] Option for is_null(NaN) to evaluate to true
> ----------------------------------------------------
>
>                 Key: ARROW-12959
>                 URL: https://issues.apache.org/jira/browse/ARROW-12959
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++, R
>            Reporter: Ian Cook
>            Priority: Major
>
> (This is the flip side of ARROW-12960.)
> Currently the Arrow compute kernel {{is_null}} always treats {{NaN}} as a 
> non-missing value, returning {{false}} at positions of the input datum with 
> value {{NaN}}.
> It would be helpful to be able to control this behavior with an option. The 
> option could be named {{nan_is_null}} or something similar.  It would default 
> to {{false}}, consistent with current behavior. When set to {{true}}, it 
> should check if the input datum has a floating point data type, and if so, 
> return {{true}} at positions where the input is {{NaN}}. If the input datum 
> has some other type, the option should be silently ignored.
> Among other things, this would enable the {{arrow}} R package to evaluate 
> {{is.na()}} consistently with the way base R does. In base R, {{is.na()}} 
> returns {{TRUE}} on {{NaN}}. But in the {{arrow}} R package, it returns 
> {{FALSE}}:
> {code:r}
> is.na(c(3.14, NA, NaN))
> ## [1] FALSE TRUE TRUE
> as.vector(is.na(Array$create(c(3.14, NA, NaN))))
> ## [1] FALSE TRUE FALSE{code}
> I think solving this with an option in the C++ kernel is the best solution, 
> because I suspect there are other cases in which users might want to treat 
> {{NaN}} as a missing value. However, it would also be possible to solve this 
> just in the R package, by defining a mapping of {{is.na}} in the R package 
> that checks if the input {{x}} has a floating point data type, and if so, 
> evaluates {{is.na\(x\) | is.nan\(x\)}}. If we choose to go that route, we 
> should change this Jira issue summary to "[R] Make is.na(NaN) consistent with 
> base R".



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to