nealrichardson opened a new pull request, #41576:
URL: https://github.com/apache/arrow/pull/41576
I started out trying to make it so that `arrow_eval()` could just raise its
errors, rather than catch them and have every caller inspect and re-raise. I
ended up pulling on this further and ended up refactoring most of the error
handling in the dplyr code paths. Summary of changes, from the bottom up:
* We have two wrappers that raise classed errors: `arrow_not_supported()`
(which previously existed but just called `stop()`) and `validation_error()`.
They raise `arrow_not_supported` and `validation_error`, respectively. Function
bindings now raise one or the other, never just stop/abort.
* `arrow_eval()` modifies the errors raised by function bindings, inserting
the expression as the `call` attribute of the error, which lets `rlang` handle
the printing cleaner, and catching any non-classed errors and re-raising them
as `arrow_not_supported` or `validation_error`, as appropriate.
* New `try_arrow_dplyr()` wrapper around everything inside (most*) dplyr
verb implementations, which only calls `abandon_ship()` on
`arrow_not_supported` errors, and re-raises everything else. For datasets, it
just adds an additional note to the error message advising you that you can
call `collect()`. So errors generally bubble up, and each of these wrappers
adds some context to the message.
The ultimate results of all of this:
* We now don't tell people to `collect()` (or, if on in-memory data, just do
it) in cases where it would also fail in regular dplyr because the input is
invalid.
* Nicer error printing across the board, using rlang/cli for formatting, and
cleaner calls and tracebacks. No more `Error: Error :` messages.
* Adds the ability to provide helpful suggestions in error messages in
bindings, for cases where there is an alternative available other than just
`collect()`.
* For us, it should be easier to work with `arrow_eval()` and the dplyr
verbs in general. There's less bookkeeping you have to do to catch and rethrow
errors, and it's consistent across the various parts of the evaluation (i.e.
the same thing works inside the dplyr verbs as in the bindings).
Some concrete examples:
1. Invalid input in a binding. Retry with dplyr won't help, so don't
automatically do it (if Table) or suggest it (if Dataset).
```r
# Before:
mtcars |>
arrow_table() |>
transmute(case_when())
#> Warning: Expression case_when() not supported in Arrow; pulling data into
R
#> Error:
#> ℹ In argument: `case_when()`.
#> Caused by error in `case_when()`:
#> ! At least one condition must be supplied.
# After:
mtcars |>
arrow_table() |>
transmute(case_when())
#> Error in `case_when()`:
#> ! No cases provided
```
2. Dealing with unsupported features outside of the bindings. This example
is something that is checked in `summarize()` but not caught inside
`arrow_eval()` because it's not about the expressions.
```r
# Before:
mtcars |>
InMemoryDataset$create() |>
group_by(cyl) |>
summarize(mean(hp), .groups = "rowwise")
#> Error: Error : .groups = "rowwise" not supported in Arrow
#> Call collect() first to pull data into R.
# After:
mtcars |>
InMemoryDataset$create() |>
group_by(cyl) |>
summarize(mean(hp), .groups = "rowwise")
#> Error in `summarise.arrow_dplyr_query()`:
#> ! .groups = "rowwise" not supported in Arrow
#> → Call collect() first to pull data into R.
```
3. more examples to come
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]