[
https://issues.apache.org/jira/browse/ARROW-9723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Antoine Pitrou resolved ARROW-9723.
-----------------------------------
Fix Version/s: 2.0.0
Resolution: Fixed
Issue resolved by pull request 8061
[https://github.com/apache/arrow/pull/8061]
> [C++] Expected behaviour of "mode" kernel with NaNs ?
> -----------------------------------------------------
>
> Key: ARROW-9723
> URL: https://issues.apache.org/jira/browse/ARROW-9723
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Joris Van den Bossche
> Assignee: Yibo Cai
> Priority: Major
> Labels: pull-request-available
> Fix For: 2.0.0
>
> Time Spent: 20m
> Remaining Estimate: 0h
>
> ARROW-9638 added a "mode" kernel to arrow::compute. There was some remaining
> discussion on how NaNs should be handled.
> The merged PR added the behaviour to "skip" NaNs (similarly as it skips
> nulls). So eg:
> {code:python}
> [NaN, NaN, 1] -> mode:1, count:1
> [null, null, 1] -> mode:1, count:1
> [null, null, null] -> null
> [NaN, NaN, NaN] -> null # should this be NaN instead?
> {code}
> But, for example {{scipy.stats}} does not skip NaNs and would for the last
> line above return {{mode:NaN, count:1}} (the NaNs are not equal to each
> other, so each NaN is counted separately, giving a count of 1).
> Also, in other aggregations like {{sum}} we skip nulls but not NaNs (so
> {{sum([NaN, NaN, 1])}} would be NaN).
> On the other hand, as [~apitrou] argued in the PR, for {{sum}} it's more
> straightforward and informative to propagate the NaN to the result (at least
> it indicates there are NaNs in the data), while for {{mode}} the count of 1
> can also be surprising/misleading.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)