jonkeane commented on code in PR #44652: URL: https://github.com/apache/arrow/pull/44652#discussion_r1835693557
########## r/tests/testthat/test-dplyr-distinct.R: ########## @@ -115,12 +103,57 @@ test_that("across() works in distinct()", { }) test_that("distinct() can return all columns", { - skip("ARROW-14045") - compare_dplyr_binding( - .input %>% - distinct(lgl, .keep_all = TRUE) %>% - collect() %>% - arrange(int), - tbl - ) + # hash_one prefers to keep non-null values, which is different from .keep_all in dplyr + # so we can't compare the result directly + expected <- tbl %>% + # Drop factor because of #44661: + # NotImplemented: Function 'hash_one' has no kernel matching input types + # (dictionary<values=string, indices=int8, ordered=0>, uint8) Review Comment: Is 110-111 the error that someone would get if they tried `distinct(..., .keep_all = TRUE)` with a factor in the table/data.frame? We might want to make that a bit nicer / more grokable for folks who might not have the dictionary -> factor knowledge top of mind ########## r/R/dplyr-distinct.R: ########## @@ -33,11 +27,28 @@ distinct.arrow_dplyr_query <- function(.data, ..., .keep_all = FALSE) { .data <- dplyr::group_by(.data, !!!syms(names(.data))) } - out <- dplyr::summarize(.data, .groups = "drop") + if (isTRUE(.keep_all)) { + # Note: in regular dplyr, `.keep_all = TRUE` returns the first row's value. + # However, Acero's `hash_one` function prefers returning non-null values. + # So, you'll get the same shape of data, but the values may differ. Review Comment: This behavior change is probably either not-impactful, or if folks are relying on it, that is actually a bug in their code. Though it does seem like something we should mention (in docs at least?). Or maybe with a one-time warning? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org