nealrichardson commented on a change in pull request #11018:
URL: https://github.com/apache/arrow/pull/11018#discussion_r713294547



##########
File path: r/R/dplyr-summarize.R
##########
@@ -251,3 +268,21 @@ extract_aggregations <- function(expr, ctx) {
   }
   expr
 }
+
+# This function recurses through expr and wraps each call to median() with a
+# call to arrow_list_element()
+wrap_median <- function(expr, hash) {
+  if (length(expr) == 1) {
+    return(expr)
+  } else {
+    if (is.call(expr) && expr[[1]] == quote(median)) {
+      if (hash) {
+        return(str2lang(paste0("arrow_list_element(", deparse1(expr), ", 
0L)")))
+      } else {
+        return(expr)

Review comment:
       I don't understand, you mean that hash_tdigest returns a list type but 
`tdigest` does not? That seems wrong, doesn't it?

##########
File path: r/R/dplyr-summarize.R
##########
@@ -161,6 +166,18 @@ summarize_eval <- function(name, quosure, ctx, recurse = 
FALSE) {
     return()
   }
 
+  if ("median" %in% funs_in_expr) {

Review comment:
       Make a comment up here explaining that for median/quantile, we have to 
translate the function into something that does a `mutate()` after aggregating, 
and we need to do that before determining outer_agg/inner_agg below.

##########
File path: r/R/dplyr-functions.R
##########
@@ -893,6 +901,12 @@ output_type <- function(fun, input_type) {
     input_type
   } else if (fun %in% c("mean", "stddev", "variance")) {
     float64()
+  } else if (fun %in% "tdigest") {
+    if (hash) {
+      fixed_size_list_of(float64(), 1L)
+    } else {
+      float64()

Review comment:
       I'm not sure what tie-breaking algorithm is being used, but wouldn't 
median/quantile be expected to return a value of the same type as the data?
   
   ```suggestion
         fixed_size_list_of(input_type, 1L)
       } else {
         input_type
   ```

##########
File path: r/tests/testthat/test-dplyr-summarize.R
##########
@@ -229,6 +229,60 @@ test_that("Group by n_distinct() on dataset", {
   )
 })
 
+test_that("median()", {
+  # with groups
+  expect_dplyr_equal(
+    input %>%
+      group_by(some_grouping) %>%
+      summarize(
+        med_dbl = median(dbl),
+        med_int = median(int),
+        med_dbl_narmf = median(dbl, FALSE),
+        # styler: off
+        med_int_narmf = median(int, na.rm = F),

Review comment:
       Why are you using F and T here?

##########
File path: r/R/dplyr-summarize.R
##########
@@ -161,6 +166,18 @@ summarize_eval <- function(name, quosure, ctx, recurse = 
FALSE) {
     return()
   }
 
+  if ("median" %in% funs_in_expr) {
+    expr <- wrap_median(expr, hash)
+    # TODO: Remove this warning after median() returns an exact median
+    # (ARROW-14021)
+    warn(

Review comment:
       I wonder if it would make more sense to move this warning to 
`agg_funcs$median` since we handle most other "arrow not supported" options in 
those functions. 

##########
File path: r/R/dplyr-functions.R
##########
@@ -847,6 +847,14 @@ agg_funcs$var <- function(x, na.rm = FALSE, ddof = 1) {
     options = list(skip_nulls = na.rm, min_count = 0L, ddof = ddof)
   )
 }
+agg_funcs$median <- function(x, na.rm = FALSE) {

Review comment:
       Can you also add `quantile`, and just error if there is more than one 
quantile specified?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to