jonkeane commented on code in PR #44652:
URL: https://github.com/apache/arrow/pull/44652#discussion_r1835693557


##########
r/tests/testthat/test-dplyr-distinct.R:
##########
@@ -115,12 +103,57 @@ test_that("across() works in distinct()", {
 })
 
 test_that("distinct() can return all columns", {
-  skip("ARROW-14045")
-  compare_dplyr_binding(
-    .input %>%
-      distinct(lgl, .keep_all = TRUE) %>%
-      collect() %>%
-      arrange(int),
-    tbl
-  )
+  # hash_one prefers to keep non-null values, which is different from 
.keep_all in dplyr
+  # so we can't compare the result directly
+  expected <- tbl %>%
+    # Drop factor because of #44661:
+    # NotImplemented: Function 'hash_one' has no kernel matching input types
+    #   (dictionary<values=string, indices=int8, ordered=0>, uint8)

Review Comment:
   Is 110-111 the error that someone would get if they tried `distinct(..., 
.keep_all = TRUE)` with a factor in the table/data.frame? 
   
   We might want to make that a bit nicer / more grokable for folks who might 
not have the dictionary -> factor knowledge top of mind



##########
r/R/dplyr-distinct.R:
##########
@@ -33,11 +27,28 @@ distinct.arrow_dplyr_query <- function(.data, ..., 
.keep_all = FALSE) {
     .data <- dplyr::group_by(.data, !!!syms(names(.data)))
   }
 
-  out <- dplyr::summarize(.data, .groups = "drop")
+  if (isTRUE(.keep_all)) {
+    # Note: in regular dplyr, `.keep_all = TRUE` returns the first row's value.
+    # However, Acero's `hash_one` function prefers returning non-null values.
+    # So, you'll get the same shape of data, but the values may differ.

Review Comment:
   This behavior change is probably either not-impactful, or if folks are 
relying on it, that is actually a bug in their code. Though it does seem like 
something we should mention (in docs at least?).
   
   Or maybe with a one-time warning? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to