ianmcook commented on a change in pull request #11257:
URL: https://github.com/apache/arrow/pull/11257#discussion_r720873549
##########
File path: r/tests/testthat/test-dplyr-summarize.R
##########
@@ -227,6 +228,19 @@ test_that("Group by n_distinct() on dataset", {
collect(),
tbl
)
+ # Without groupby
+ expect_dplyr_equal(
+ input %>%
+ summarize(distinct = n_distinct(lgl, na.rm = FALSE)) %>%
+ collect(),
+ tbl
+ )
+ expect_dplyr_equal(
+ input %>%
+ summarize(distinct = n_distinct(lgl, na.rm = TRUE)) %>%
Review comment:
I think we will need to save this for a follow-up.
As @aucahuasi says, the `count_distinct` kernel is unary. In theory we could
make it count the distinct combinations of values in two or more columns by
sticking all the columns into a struct and passing the struct to
`count_distinct`. So for example we would replace
```r
summarise(n_distinct(x, y))
```
with
```r
summarise(n_distinct(arrow_make_struct(x, y, options = list(field_names =
c("x", "y")))))
```
But there are two challenges there:
1. This would require doing more hacky
recurse-through-the-AST-and-replace-stuff, like in #11018
2. I don't think this `count_distinct` kernel currently accepts nested types.
I opened ARROW-14209 for the follow-up.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]