TPDeramus opened a new issue, #43207:
URL: https://github.com/apache/arrow/issues/43207
### Describe the usage question you have. Please include as many useful
details as possible.
Hi Arrow devs.
I wanted to ask about something I noticed about using the column-wise
operators with `dplyr` in `arrow` tables.
If I had an arrow table, and I wanted to run a basic function such as
`mean`, `max`, or `min` using `summarize`, it appears that `arrow` does not
currently accept the `na.rm = TRUE` argument, or that if it does, I can't seem
to find it in the documentation.
Say I took the original dataset:
Producing:
| Participant | Rating |
| ------------ | -------- |
| Donna | 17 |
| Donna | NA |
| Greg | 21 |
| Greg | NA |
For example, if these were generic `R` dataframes, either of these two calls
would work (though one is deprecated):
```
data.frame(
Participant = c('Greg', 'Greg', 'Donna', 'Donna'),
Rating = c(21, NA, 17, NA)
) |>
group_by(Participant) |>
summarize(across(matches("Rating"), \(x) max(x, na.rm = TRUE))) |>
as.data.frame()
data.frame(
Participant = c('Greg', 'Greg', 'Donna', 'Donna'),
Rating = c(21, NA, 17, NA)
) |>
group_by(Participant) |>
summarize(across(matches("Rating"), max, na.rm = TRUE)) |>
as.data.frame()
```
Producing:
| Participant | Rating |
| ------------ | -------- |
| Donna | 17 |
| Greg | 21 |
However, when I run the same commands as an arrow table, both throw errors:
```
data.frame(
Participant = c('Greg', 'Greg', 'Donna', 'Donna'),
Rating = c(21, NA, 17, NA)
) |>
as_arrow_table() |>
group_by(Participant) |>
summarize(across(matches("Rating"), \(x) max(x, na.rm = TRUE))) |>
as.data.frame()
Error in `across_setup()`:
! Anonymous functions are not yet supported in Arrow
Run `rlang::last_trace()` to see where the error occurred.
data.frame(
Participant = c('Greg', 'Greg', 'Donna', 'Donna'),
Rating = c(21, NA, 17, NA)
) |>
as_arrow_table() |>
group_by(Participant) |>
summarize(across(matches("Rating"), max, na.rm = TRUE)) |>
as.data.frame()
Error in `expand_across()`:
! `...` argument to `across()` is deprecated in dplyr and not supported in
Arrow
Run `rlang::last_trace()` to see where the error occurred.
```
And the one that does work:
```
data.frame(
Participant = c('Greg', 'Greg', 'Donna', 'Donna'),
Rating = c(21, NA, 17, NA)
) |>
as_arrow_table() |>
group_by(Participant) |>
summarize(across(matches("Rating"), max)) |>
as.data.frame()
```
Returns `NA` values that are not what I want:
| Participant | Rating |
| ------------ | -------- |
| Donna | NA |
| Greg | NA |
Is there a way to pass the `na.rm = TRUE` argument to this call without
having to manually drop the `NA` values for each column or row of interest I
have in my data?
### Component(s)
R
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]