[ 
https://issues.apache.org/jira/browse/ARROW-13764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17405299#comment-17405299
 ] 

Neal Richardson commented on ARROW-13764:
-----------------------------------------

[~lidavidm] Neither of those, if I understand it. Here's the expectation from 
dplyr: 

{code}
> data.frame(keys = c(0, 0, 1, 1, NA), values = c("a", NA, "b", "c", "d")) %>% 
> group_by(keys) %>% summarize(n_distinct(values))
# A tibble: 3 × 2
   keys `n_distinct(values)`
  <dbl>                <int>
1     0                    2
2     1                    2
3    NA                    1
> data.frame(keys = c(0, 0, 1, 1, NA), values = c("a", NA, "b", "c", "d")) %>% 
> group_by(keys) %>% summarize(n_distinct(values, na.rm = TRUE))
# A tibble: 3 × 2
   keys `n_distinct(values, na.rm = TRUE)`
  <dbl>                              <int>
1     0                                  1
2     1                                  2
3    NA                                  1
{code}

> [C++] Implement ScalarAggregateOptions for count_distinct (grouped) 
> --------------------------------------------------------------------
>
>                 Key: ARROW-13764
>                 URL: https://issues.apache.org/jira/browse/ARROW-13764
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Nic Crane
>            Assignee: David Li
>            Priority: Major
>              Labels: kernel
>             Fix For: 6.0.0
>
>
> I'm writing the R bindings for the grouped {{count_distinct}} kernel, but the 
> current implementation counts nulls as their own group.  To match the R 
> behaviour,  I need to be able to specify whether or not to remove NA/NULL 
> values.
> Please could we have ScalarAggregateOptions implemented for 
> {{count_distinct}}?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to