Re: [PR] GH-29642: [R] Support for .keep_all = TRUE with distinct() [arrow]

via GitHub Fri, 15 Nov 2024 05:28:16 -0800


nealrichardson commented on code in PR #44652:
URL: https://github.com/apache/arrow/pull/44652#discussion_r1843769552



##########
r/R/dplyr-distinct.R:
##########
@@ -33,11 +27,28 @@ distinct.arrow_dplyr_query <- function(.data, ..., 
.keep_all = FALSE) {
     .data <- dplyr::group_by(.data, !!!syms(names(.data)))
   }
 
-  out <- dplyr::summarize(.data, .groups = "drop")
+  if (isTRUE(.keep_all)) {
+    # Note: in regular dplyr, `.keep_all = TRUE` returns the first row's value.
+    # However, Acero's `hash_one` function prefers returning non-null values.
+    # So, you'll get the same shape of data, but the values may differ.

Review Comment:
   It is documented on the acero man page, that's the change to 
arrow-package.R. I'd rather not one-time warning; that's a slippery slope if we 
were going to be chatty about every subtle difference between how Acero works 
from dplyr on data.frames.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] GH-29642: [R] Support for .keep_all = TRUE with distinct() [arrow]

Reply via email to