[ 
https://issues.apache.org/jira/browse/ARROW-9723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yibo Cai reassigned ARROW-9723:
-------------------------------

    Assignee: Yibo Cai

> [C++] Expected behaviour of "mode" kernel with NaNs ?
> -----------------------------------------------------
>
>                 Key: ARROW-9723
>                 URL: https://issues.apache.org/jira/browse/ARROW-9723
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Joris Van den Bossche
>            Assignee: Yibo Cai
>            Priority: Major
>
> ARROW-9638 added a "mode" kernel to arrow::compute. There was some remaining 
> discussion on how NaNs should be handled.
> The merged PR added the behaviour to "skip" NaNs (similarly as it skips 
> nulls). So eg:
> {code:python}
> [NaN, NaN, 1] -> mode:1, count:1
> [null, null, 1] -> mode:1, count:1
> [null, null, null] -> null
> [NaN, NaN, NaN] -> null  # should this be NaN instead?
> {code}
> But, for example {{scipy.stats}} does not skip NaNs and would for the last 
> line above return {{mode:NaN, count:1}} (the NaNs are not equal to each 
> other, so each NaN is counted separately, giving a count of 1).  
> Also, in other aggregations like {{sum}} we skip nulls but not NaNs (so 
> {{sum([NaN, NaN, 1])}} would be NaN).
> On the other hand, as [~apitrou] argued in the PR, for {{sum}} it's more 
> straightforward and informative to propagate the NaN to the result (at least 
> it indicates there are NaNs in the data), while for {{mode}} the count of 1 
> can also be surprising/misleading.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to