[GitHub] [arrow] mopcup commented on a diff in pull request #13934: ARROW-14045 [R] Support for .keep_all = TRUE with distinct()

GitBox Fri, 23 Sep 2022 00:34:10 -0700


mopcup commented on code in PR #13934:
URL: https://github.com/apache/arrow/pull/13934#discussion_r978341369



##########
r/R/dplyr-distinct.R:
##########
@@ -35,8 +29,24 @@ distinct.arrow_dplyr_query <- function(.data, ..., .keep_all 
= FALSE) {
     # distinct() with no vars specified means distinct across all cols
     .data <- dplyr::group_by(.data, !!!syms(names(.data)))
   }
-
-  out <- dplyr::summarize(.data, .groups = "drop")
+  if (isTRUE(.keep_all)) {
+    # (TODO) `.keep_all = TRUE` can return first row value, but this 
implementation
+    # do not always return it because `hash_one` skips rows if they contain 
null value.
+    # If group vars do not uniquely determine return values of each cols,
+    # the result will become different from the original.
+    # If NOT, this option may distroy data.
+    warning(".keep_all = TRUE currently not guarantee to take first row value 
in each cols.")

Review Comment:
   > Instead of raising a warning, can you rebase this PR
   
   I did.
   > and in 
https://github.com/apache/arrow/blob/master/r/R/arrow-package.R#L53, leave the 
note that ".keep_all = TRUE will keep one non-missing value for each column but 
it may not be the 'first row'"
   
   Please let me confirm. Is this fix correct?
   ```
   diff --git a/r/R/arrow-package.R b/r/R/arrow-package.R
   index e6b3f481e..9c003969d 100644
   --- a/r/R/arrow-package.R
   +++ b/r/R/arrow-package.R
   @@ -50,7 +50,7 @@ supported_dplyr_methods <- list(
      relocate = NULL,
      compute = NULL,
      collapse = NULL,
   -  distinct = NULL,
   +  distinct = ".keep_all = TRUE will keep one non-missing value for each 
column but it may not be the 'first row'",
      left_join = NULL,
      right_join = NULL,
      inner_join = NULL,
   ```
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] mopcup commented on a diff in pull request #13934: ARROW-14045 [R] Support for .keep_all = TRUE with distinct()

Reply via email to