[GitHub] [arrow] nealrichardson commented on a diff in pull request #13934: ARROW-14045 [R] Support for .keep_all = TRUE with distinct()

GitBox Mon, 19 Sep 2022 09:49:57 -0700


nealrichardson commented on code in PR #13934:
URL: https://github.com/apache/arrow/pull/13934#discussion_r974470488



##########
r/R/dplyr-distinct.R:
##########
@@ -35,8 +29,24 @@ distinct.arrow_dplyr_query <- function(.data, ..., .keep_all 
= FALSE) {
     # distinct() with no vars specified means distinct across all cols
     .data <- dplyr::group_by(.data, !!!syms(names(.data)))
   }
-
-  out <- dplyr::summarize(.data, .groups = "drop")
+  if (isTRUE(.keep_all)) {
+    # (TODO) `.keep_all = TRUE` can return first row value, but this 
implementation
+    # do not always return it because `hash_one` skips rows if they contain 
null value.
+    # If group vars do not uniquely determine return values of each cols,
+    # the result will become different from the original.
+    # If NOT, this option may distroy data.
+    warning(".keep_all = TRUE currently not guarantee to take first row value 
in each cols.")
+    keeps <- names(.data)[!(names(.data) %in% .data$group_by_vars)]
+    # `one()` is wrapper for calling "hash_one" function (implemented 
ARROW-13993)
+    # `USAGE: summarize(x = one(x), y = one(y) ...)` for x, y in non-group cols
+    exprs <- lapply(keeps, function(x) call2("one", sym(x)))
+    names(exprs) <- keeps
+    out <- dplyr::summarize(.data, !!!exprs, .groups = "drop")
+    # restore cols order

Review Comment:
   Why would the column order be wrong? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] nealrichardson commented on a diff in pull request #13934: ARROW-14045 [R] Support for .keep_all = TRUE with distinct()

Reply via email to