[PR] WIP [R] Better error handling in dplyr code [arrow]

via GitHub Tue, 07 May 2024 08:35:35 -0700


nealrichardson opened a new pull request, #41576:
URL: https://github.com/apache/arrow/pull/41576


   I started out trying to make it so that `arrow_eval()` could just raise its 
errors, rather than catch them and have every caller inspect and re-raise. I 
ended up pulling on this further and ended up refactoring most of the error 
handling in the dplyr code paths. Summary of changes, from the bottom up:
   
   * We have two wrappers that raise classed errors: `arrow_not_supported()` 
(which previously existed but just called `stop()`) and `validation_error()`. 
They raise `arrow_not_supported` and `validation_error`, respectively. Function 
bindings now raise one or the other, never just stop/abort.
   * `arrow_eval()` modifies the errors raised by function bindings, inserting 
the expression as the `call` attribute of the error, which lets `rlang` handle 
the printing cleaner, and catching any non-classed errors and re-raising them 
as `arrow_not_supported` or `validation_error`, as appropriate.
   * New `try_arrow_dplyr()` wrapper around everything inside (most*) dplyr 
verb implementations, which only calls `abandon_ship()` on 
`arrow_not_supported` errors, and re-raises everything else. For datasets, it 
just adds an additional note to the error message advising you that you can 
call `collect()`. So errors generally bubble up, and each of these wrappers 
adds some context to the message.
   
   The ultimate results of all of this:
   
   * We now don't tell people to `collect()` (or, if on in-memory data, just do 
it) in cases where it would also fail in regular dplyr because the input is 
invalid. 
   * Nicer error printing across the board, using rlang/cli for formatting, and 
cleaner calls and tracebacks. No more `Error: Error :` messages.
   * Adds the ability to provide helpful suggestions in error messages in 
bindings, for cases where there is an alternative available other than just 
`collect()`.
   * For us, it should be easier to work with `arrow_eval()` and the dplyr 
verbs in general. There's less bookkeeping you have to do to catch and rethrow 
errors, and it's consistent across the various parts of the evaluation (i.e. 
the same thing works inside the dplyr verbs as in the bindings).
   
   Some concrete examples:
   
   1. Invalid input in a binding. Retry with dplyr won't help, so don't 
automatically do it (if Table) or suggest it (if Dataset).
   
   ```r
   # Before: 
   mtcars |> 
     arrow_table() |> 
     transmute(case_when())
   #> Warning: Expression case_when() not supported in Arrow; pulling data into 
R
   #> Error:
   #> ℹ In argument: `case_when()`.
   #> Caused by error in `case_when()`:
   #> ! At least one condition must be supplied.
   
   # After:
   mtcars |>
     arrow_table() |>
     transmute(case_when())
   #> Error in `case_when()`:
   #> ! No cases provided
   ```
   
   2. Dealing with unsupported features outside of the bindings. This example 
is something that is checked in `summarize()` but not caught inside 
`arrow_eval()` because it's not about the expressions.
   
   ```r
   # Before:
   mtcars |> 
     InMemoryDataset$create() |> 
     group_by(cyl) |> 
     summarize(mean(hp), .groups = "rowwise")
   #> Error: Error : .groups = "rowwise" not supported in Arrow
   #> Call collect() first to pull data into R.
   
   # After:
   mtcars |>
     InMemoryDataset$create() |> 
     group_by(cyl) |> 
     summarize(mean(hp), .groups = "rowwise")
   #> Error in `summarise.arrow_dplyr_query()`:
   #> ! .groups = "rowwise" not supported in Arrow
   #> → Call collect() first to pull data into R.
   ```
   
   3. more examples to come


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] WIP [R] Better error handling in dplyr code [arrow]

Reply via email to