mopcup commented on code in PR #13934:
URL: https://github.com/apache/arrow/pull/13934#discussion_r978341369
##########
r/R/dplyr-distinct.R:
##########
@@ -35,8 +29,24 @@ distinct.arrow_dplyr_query <- function(.data, ..., .keep_all
= FALSE) {
# distinct() with no vars specified means distinct across all cols
.data <- dplyr::group_by(.data, !!!syms(names(.data)))
}
-
- out <- dplyr::summarize(.data, .groups = "drop")
+ if (isTRUE(.keep_all)) {
+ # (TODO) `.keep_all = TRUE` can return first row value, but this
implementation
+ # do not always return it because `hash_one` skips rows if they contain
null value.
+ # If group vars do not uniquely determine return values of each cols,
+ # the result will become different from the original.
+ # If NOT, this option may distroy data.
+ warning(".keep_all = TRUE currently not guarantee to take first row value
in each cols.")
Review Comment:
> Instead of raising a warning, can you rebase this PR
I did.
> and in
https://github.com/apache/arrow/blob/master/r/R/arrow-package.R#L53, leave the
note that ".keep_all = TRUE will keep one non-missing value for each column but
it may not be the 'first row'"
Please let me confirm. Is this fix correct?
```
diff --git a/r/R/arrow-package.R b/r/R/arrow-package.R
index e6b3f481e..9c003969d 100644
--- a/r/R/arrow-package.R
+++ b/r/R/arrow-package.R
@@ -50,7 +50,7 @@ supported_dplyr_methods <- list(
relocate = NULL,
compute = NULL,
collapse = NULL,
- distinct = NULL,
+ distinct = ".keep_all = TRUE will keep one non-missing value for each
column but it may not be the 'first row'",
left_join = NULL,
right_join = NULL,
inner_join = NULL,
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]