jonkeane commented on a change in pull request #12179:
URL: https://github.com/apache/arrow/pull/12179#discussion_r787102302
##########
File path: r/R/dplyr-join.R
##########
@@ -117,10 +117,33 @@ handle_join_by <- function(by, x, y) {
if (is.null(names(by))) {
by <- set_names(by)
}
- # TODO: nicer messages?
- stopifnot(
- all(names(by) %in% names(x)),
- all(by %in% names(y))
- )
+
+ missing_x_cols <- setdiff(names(by), names(x))
+ if (length(missing_x_cols) > 0) {
+ message <- paste(
+ "Join",
+ ngettext(length(missing_x_cols), "column", "columns"),
+ "must be present in data."
+ )
+ message_x <- paste(
+ oxford_paste(missing_x_cols, quote_symbol = "`"),
+ "not present in x."
+ )
+ abort(c(message, x = message_x))
+ }
+
+ missing_y_cols <- setdiff(by, names(y))
+ if (length(missing_y_cols) > 0) {
+ message <- paste(
+ "Join",
+ ngettext(length(missing_y_cols), "column", "columns"),
+ "must be present in data."
+ )
+ message_y <- paste(
+ oxford_paste(missing_y_cols, quote_symbol = "`"),
+ "not present in y."
+ )
+ abort(c(message, x = message_y))
+ }
Review comment:
Would it be possible to accumulate message_x and message_y in parallel
and then print them both? This might ultimately simplify this (e.g. the "join
column must be present in data" would happen once). But more importantly it
would give people the feedback about both x and y at the beginning instead of
fixing one and then having another pop up.
##########
File path: r/tests/testthat/test-dplyr-join.R
##########
@@ -90,9 +90,57 @@ test_that("Error handling", {
left_tab %>%
left_join(to_join, by = "not_a_col") %>%
collect(),
- "all(names(by) %in% names(x)) is not TRUE",
- fixed = TRUE
+ "Join column must be present in data"
)
+ expect_snapshot({
+ (expect_error(
Review comment:
The extra wrapping of `()`s is to get the error from a whole dplyr
chain, right? (at least, that's what I think
https://github.com/r-lib/testthat/issues/1471 is saying).
In these cases, there's not actually a chain, so these should just work
(right?)
##########
File path: r/tests/testthat/_snaps/dplyr-join.md
##########
@@ -0,0 +1,50 @@
+# Error handling
+
+ Code
+ (expect_error(left_join(arrow_table(example_data),
arrow_table(example_data),
+ by = "made_up_colname")))
+ Output
+ <error/rlang_error>
+ Join column must be present in data.
+ x `made_up_colname` not present in x.
Review comment:
The error messages including x/y is a nice touch, though it should be
pointed out that does differ from dplyr:
```
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
left_join(mtcars, mtcars, by = "foo")
#> Error: Join columns must be present in data.
#> ✖ Problem with `foo`.
```
Giving the explicit x/y is friendlier though, so I'm happy to keep it.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]