[I] [R] `dplyr` `summarize` commands do not accept `na.rm` arguments [arrow]

via GitHub Wed, 10 Jul 2024 06:47:04 -0700


TPDeramus opened a new issue, #43207:
URL: https://github.com/apache/arrow/issues/43207


   ### Describe the usage question you have. Please include as many useful 
details as  possible.
   
   
   Hi Arrow devs.
   
   I wanted to ask about something I noticed about using the column-wise 
operators with `dplyr` in `arrow` tables.
   
    If I had an arrow table, and I wanted to run a basic function such as 
`mean`, `max`, or `min` using `summarize`, it appears that `arrow` does not 
currently accept the `na.rm = TRUE` argument, or that if it does, I can't seem 
to find it in the documentation.
   
   Say I took the original dataset:
   
   Producing:
   | Participant  | Rating |
   | ------------ | -------- |
   | Donna        | 17        |
   | Donna        | NA       |
   | Greg           | 21        |
   | Greg           | NA       |
   
   For example, if these were generic `R` dataframes, either of these two calls 
would work (though one is deprecated):
   
   ```
   data.frame(
     Participant = c('Greg', 'Greg', 'Donna', 'Donna'),
     Rating = c(21, NA, 17, NA)
   ) |>
     group_by(Participant) |>
     summarize(across(matches("Rating"), \(x) max(x, na.rm = TRUE))) |>
     as.data.frame()
   
   data.frame(
     Participant = c('Greg', 'Greg', 'Donna', 'Donna'),
     Rating = c(21, NA, 17, NA)
   ) |>
     group_by(Participant) |>
     summarize(across(matches("Rating"), max, na.rm = TRUE)) |>
     as.data.frame()
   
   ```
   Producing:
   | Participant  | Rating |
   | ------------ | -------- |
   | Donna        | 17        |
   | Greg           | 21        |
   
   However, when I run the same commands as an arrow table, both throw errors:
   
   ```
   data.frame(
     Participant = c('Greg', 'Greg', 'Donna', 'Donna'),
     Rating = c(21, NA, 17, NA)
   ) |>
     as_arrow_table() |>
     group_by(Participant) |>
     summarize(across(matches("Rating"), \(x) max(x, na.rm = TRUE))) |>
     as.data.frame()
   
   Error in `across_setup()`:
   ! Anonymous functions are not yet supported in Arrow
   Run `rlang::last_trace()` to see where the error occurred.
   
   data.frame(
     Participant = c('Greg', 'Greg', 'Donna', 'Donna'),
     Rating = c(21, NA, 17, NA)
   ) |>
     as_arrow_table() |>
     group_by(Participant) |>
     summarize(across(matches("Rating"), max, na.rm = TRUE)) |>
     as.data.frame()
   
   Error in `expand_across()`:
   ! `...` argument to `across()` is deprecated in dplyr and not supported in 
Arrow
   Run `rlang::last_trace()` to see where the error occurred.
   ```
    And the one that does work:
   ```
   data.frame(
     Participant = c('Greg', 'Greg', 'Donna', 'Donna'),
     Rating = c(21, NA, 17, NA)
   ) |>
     as_arrow_table() |>
     group_by(Participant) |>
     summarize(across(matches("Rating"), max)) |>
     as.data.frame()
   ```
   
   Returns `NA` values that are not what I want:
   | Participant  | Rating |
   | ------------ | -------- |
   | Donna        | NA       |
   | Greg           | NA       |
   
   Is there a way to pass the `na.rm = TRUE` argument to this call without 
having to manually drop the `NA` values for each column or row of interest I 
have in my data?
   
   ### Component(s)
   
   R


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[I] [R] `dplyr` `summarize` commands do not accept `na.rm` arguments [arrow]

Reply via email to