[ https://issues.apache.org/jira/browse/ARROW-9723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Yibo Cai reassigned ARROW-9723: ------------------------------- Assignee: Yibo Cai > [C++] Expected behaviour of "mode" kernel with NaNs ? > ----------------------------------------------------- > > Key: ARROW-9723 > URL: https://issues.apache.org/jira/browse/ARROW-9723 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ > Reporter: Joris Van den Bossche > Assignee: Yibo Cai > Priority: Major > > ARROW-9638 added a "mode" kernel to arrow::compute. There was some remaining > discussion on how NaNs should be handled. > The merged PR added the behaviour to "skip" NaNs (similarly as it skips > nulls). So eg: > {code:python} > [NaN, NaN, 1] -> mode:1, count:1 > [null, null, 1] -> mode:1, count:1 > [null, null, null] -> null > [NaN, NaN, NaN] -> null # should this be NaN instead? > {code} > But, for example {{scipy.stats}} does not skip NaNs and would for the last > line above return {{mode:NaN, count:1}} (the NaNs are not equal to each > other, so each NaN is counted separately, giving a count of 1). > Also, in other aggregations like {{sum}} we skip nulls but not NaNs (so > {{sum([NaN, NaN, 1])}} would be NaN). > On the other hand, as [~apitrou] argued in the PR, for {{sum}} it's more > straightforward and informative to propagate the NaN to the result (at least > it indicates there are NaNs in the data), while for {{mode}} the count of 1 > can also be surprising/misleading. -- This message was sent by Atlassian Jira (v8.3.4#803005)