nealrichardson commented on code in PR #13934:
URL: https://github.com/apache/arrow/pull/13934#discussion_r974470488
##########
r/R/dplyr-distinct.R:
##########
@@ -35,8 +29,24 @@ distinct.arrow_dplyr_query <- function(.data, ..., .keep_all
= FALSE) {
# distinct() with no vars specified means distinct across all cols
.data <- dplyr::group_by(.data, !!!syms(names(.data)))
}
-
- out <- dplyr::summarize(.data, .groups = "drop")
+ if (isTRUE(.keep_all)) {
+ # (TODO) `.keep_all = TRUE` can return first row value, but this
implementation
+ # do not always return it because `hash_one` skips rows if they contain
null value.
+ # If group vars do not uniquely determine return values of each cols,
+ # the result will become different from the original.
+ # If NOT, this option may distroy data.
+ warning(".keep_all = TRUE currently not guarantee to take first row value
in each cols.")
+ keeps <- names(.data)[!(names(.data) %in% .data$group_by_vars)]
+ # `one()` is wrapper for calling "hash_one" function (implemented
ARROW-13993)
+ # `USAGE: summarize(x = one(x), y = one(y) ...)` for x, y in non-group cols
+ exprs <- lapply(keeps, function(x) call2("one", sym(x)))
+ names(exprs) <- keeps
+ out <- dplyr::summarize(.data, !!!exprs, .groups = "drop")
+ # restore cols order
Review Comment:
Why would the column order be wrong?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]