[GitHub] [arrow] ianmcook commented on a change in pull request #11257: ARROW-14035: [C++][Python][R] Implement count distinct kernel

GitBox Sun, 03 Oct 2021 12:12:44 -0700


ianmcook commented on a change in pull request #11257:
URL: https://github.com/apache/arrow/pull/11257#discussion_r720873549




##########
File path: r/tests/testthat/test-dplyr-summarize.R
##########
@@ -227,6 +228,19 @@ test_that("Group by n_distinct() on dataset", {
       collect(),
     tbl
   )
+  # Without groupby
+  expect_dplyr_equal(
+    input %>%
+      summarize(distinct = n_distinct(lgl, na.rm = FALSE)) %>%
+      collect(),
+    tbl
+  )
+  expect_dplyr_equal(
+    input %>%
+      summarize(distinct = n_distinct(lgl, na.rm = TRUE)) %>%

Review comment:
       I think we will need to save this for a follow-up.
   
   As @aucahuasi says, the `count_distinct` kernel is unary. In theory we could 
make it count the distinct combinations of values in two or more columns by 
sticking all the columns into a struct and passing the struct to 
`count_distinct`. So for example we would replace
   ```r
   summarise(n_distinct(x, y))
   ```
   with
   ```r
   summarise(n_distinct(arrow_make_struct(x, y, options = list(field_names = 
c("x", "y")))))
   ```
   
   But there are two challenges there:
   
   1. This would require doing more hacky 
recurse-through-the-AST-and-replace-stuff, like in #11018
   2. I don't think this `count_distinct` kernel currently accepts nested types.
   
   I opened ARROW-14209 for the follow-up.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] ianmcook commented on a change in pull request #11257: ARROW-14035: [C++][Python][R] Implement count distinct kernel

Reply via email to