jonkeane commented on a change in pull request #12179:
URL: https://github.com/apache/arrow/pull/12179#discussion_r787102302



##########
File path: r/R/dplyr-join.R
##########
@@ -117,10 +117,33 @@ handle_join_by <- function(by, x, y) {
   if (is.null(names(by))) {
     by <- set_names(by)
   }
-  # TODO: nicer messages?
-  stopifnot(
-    all(names(by) %in% names(x)),
-    all(by %in% names(y))
-  )
+
+  missing_x_cols <- setdiff(names(by), names(x))
+  if (length(missing_x_cols) > 0) {
+    message <- paste(
+      "Join",
+      ngettext(length(missing_x_cols), "column", "columns"),
+      "must be present in data."
+    )
+    message_x <- paste(
+      oxford_paste(missing_x_cols, quote_symbol = "`"),
+      "not present in x."
+      )
+    abort(c(message, x = message_x))
+  }
+
+  missing_y_cols <- setdiff(by, names(y))
+  if (length(missing_y_cols) > 0) {
+    message <- paste(
+      "Join",
+      ngettext(length(missing_y_cols), "column", "columns"),
+      "must be present in data."
+    )
+    message_y <- paste(
+      oxford_paste(missing_y_cols, quote_symbol = "`"),
+      "not present in y."
+    )
+    abort(c(message, x = message_y))
+  }

Review comment:
       Would it be possible to accumulate message_x and message_y in parallel 
and then print them both? This might ultimately simplify this (e.g. the "join 
column must be present in data" would happen once). But more importantly it 
would give people the feedback about both x and y at the beginning instead of 
fixing one and then having another pop up.
   
   

##########
File path: r/tests/testthat/test-dplyr-join.R
##########
@@ -90,9 +90,57 @@ test_that("Error handling", {
     left_tab %>%
       left_join(to_join, by = "not_a_col") %>%
       collect(),
-    "all(names(by) %in% names(x)) is not TRUE",
-    fixed = TRUE
+    "Join column must be present in data"
   )
+  expect_snapshot({
+    (expect_error(

Review comment:
       The extra wrapping of `()`s is to get the error from a whole dplyr 
chain, right? (at least, that's what I think 
https://github.com/r-lib/testthat/issues/1471 is saying). 
   
   In these cases, there's not actually a chain, so these should just work 
(right?)

##########
File path: r/tests/testthat/_snaps/dplyr-join.md
##########
@@ -0,0 +1,50 @@
+# Error handling
+
+    Code
+      (expect_error(left_join(arrow_table(example_data), 
arrow_table(example_data),
+      by = "made_up_colname")))
+    Output
+      <error/rlang_error>
+      Join column must be present in data.
+      x `made_up_colname` not present in x.

Review comment:
       The error messages including x/y is a nice touch, though it should be 
pointed out that does differ from dplyr:
   
   ```
   library(dplyr)
   #> 
   #> Attaching package: 'dplyr'
   #> The following objects are masked from 'package:stats':
   #> 
   #>     filter, lag
   #> The following objects are masked from 'package:base':
   #> 
   #>     intersect, setdiff, setequal, union
   left_join(mtcars, mtcars, by = "foo")
   #> Error: Join columns must be present in data.
   #> ✖ Problem with `foo`.
   ```
   
   Giving the explicit x/y is friendlier though, so I'm happy to keep it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to