[
https://issues.apache.org/jira/browse/ARROW-13764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17405299#comment-17405299
]
Neal Richardson commented on ARROW-13764:
-----------------------------------------
[~lidavidm] Neither of those, if I understand it. Here's the expectation from
dplyr:
{code}
> data.frame(keys = c(0, 0, 1, 1, NA), values = c("a", NA, "b", "c", "d")) %>%
> group_by(keys) %>% summarize(n_distinct(values))
# A tibble: 3 × 2
keys `n_distinct(values)`
<dbl> <int>
1 0 2
2 1 2
3 NA 1
> data.frame(keys = c(0, 0, 1, 1, NA), values = c("a", NA, "b", "c", "d")) %>%
> group_by(keys) %>% summarize(n_distinct(values, na.rm = TRUE))
# A tibble: 3 × 2
keys `n_distinct(values, na.rm = TRUE)`
<dbl> <int>
1 0 1
2 1 2
3 NA 1
{code}
> [C++] Implement ScalarAggregateOptions for count_distinct (grouped)
> --------------------------------------------------------------------
>
> Key: ARROW-13764
> URL: https://issues.apache.org/jira/browse/ARROW-13764
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Nic Crane
> Assignee: David Li
> Priority: Major
> Labels: kernel
> Fix For: 6.0.0
>
>
> I'm writing the R bindings for the grouped {{count_distinct}} kernel, but the
> current implementation counts nulls as their own group. To match the R
> behaviour, I need to be able to specify whether or not to remove NA/NULL
> values.
> Please could we have ScalarAggregateOptions implemented for
> {{count_distinct}}?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)