jonkeane commented on a change in pull request #12179:
URL: https://github.com/apache/arrow/pull/12179#discussion_r788171888



##########
File path: r/R/dplyr-join.R
##########
@@ -117,10 +117,33 @@ handle_join_by <- function(by, x, y) {
   if (is.null(names(by))) {
     by <- set_names(by)
   }
-  # TODO: nicer messages?
-  stopifnot(
-    all(names(by) %in% names(x)),
-    all(by %in% names(y))
-  )
+
+  missing_x_cols <- setdiff(names(by), names(x))
+  if (length(missing_x_cols) > 0) {
+    message <- paste(
+      "Join",
+      ngettext(length(missing_x_cols), "column", "columns"),
+      "must be present in data."
+    )
+    message_x <- paste(
+      oxford_paste(missing_x_cols, quote_symbol = "`"),
+      "not present in x."
+      )
+    abort(c(message, x = message_x))
+  }
+
+  missing_y_cols <- setdiff(by, names(y))
+  if (length(missing_y_cols) > 0) {
+    message <- paste(
+      "Join",
+      ngettext(length(missing_y_cols), "column", "columns"),
+      "must be present in data."
+    )
+    message_y <- paste(
+      oxford_paste(missing_y_cols, quote_symbol = "`"),
+      "not present in y."
+    )
+    abort(c(message, x = message_y))
+  }

Review comment:
       Nice! I think this code is also even cleaner than the first version (and 
has a better behavior!)

##########
File path: r/tests/testthat/_snaps/dplyr-join.md
##########
@@ -0,0 +1,46 @@
+# Error handling
+
+    Code
+      left_join(arrow_table(example_data), arrow_table(example_data), by = 
"made_up_colname")
+    Error <rlang_error>
+      Join columns must be present in data.
+      x `made_up_colname` not present in x.
+      x `made_up_colname` not present in y.
+
+---
+
+    Code
+      left_join(arrow_table(example_data), arrow_table(example_data), by = 
c(int = "made_up_colname"))
+    Error <rlang_error>
+      Join columns must be present in data.
+      x `made_up_colname` not present in y.
+
+---
+
+    Code
+      left_join(arrow_table(example_data), arrow_table(example_data), by = c(
+        made_up_colname = "int"))
+    Error <rlang_error>
+      Join columns must be present in data.
+      x `made_up_colname` not present in x.
+
+---
+
+    Code
+      left_join(arrow_table(example_data), arrow_table(example_data), by = c(
+        "made_up_colname1", "made_up_colname2"))
+    Error <rlang_error>
+      Join columns must be present in data.
+      x `made_up_colname1` and `made_up_colname2` not present in x.
+      x `made_up_colname1` and `made_up_colname2` not present in y.
+
+---
+
+    Code
+      left_join(arrow_table(example_data), arrow_table(example_data), by = c(
+        made_up_colname1 = "made_up_colname2"))
+    Error <rlang_error>
+      Join columns must be present in data.
+      x `made_up_colname1` not present in x.
+      x `made_up_colname2` not present in y.

Review comment:
       Oh this is super nice + clear and someone won't be left fixing 
`made_up_colname1` and then getting hit with `made_up_colname2`

##########
File path: r/tests/testthat/test-dplyr-join.R
##########
@@ -90,8 +90,56 @@ test_that("Error handling", {
     left_tab %>%
       left_join(to_join, by = "not_a_col") %>%
       collect(),
-    "all(names(by) %in% names(x)) is not TRUE",
-    fixed = TRUE
+    "Join columns must be present in data"
+  )
+
+  # we print both message_x and message_y with an unnamed `by` vector

Review comment:
       I like the comments here that explain _why_ these tests are here / what 
is being checked — though part of me wonders if this is showing the downside to 
snapshots (not having the error here in the code showing what we are looking). 
   
   Let's keep the snapshots for now (especially cause greping the 
`rlang::abort()` messages is not super fun). Is there a Jira already for adding 
something to our devdocs about how to use snapshots/our philosophy on how to 
review them/limitations (e.g. C++ tests not working well)/that kind of stuff? 
We shouldn't expand the scope of this PR for just that, but it would be good to 
make a jira that adds that info so we all agree how we will use them going 
forward.

##########
File path: r/tests/testthat/test-dplyr-join.R
##########
@@ -90,9 +90,57 @@ test_that("Error handling", {
     left_tab %>%
       left_join(to_join, by = "not_a_col") %>%
       collect(),
-    "all(names(by) %in% names(x)) is not TRUE",
-    fixed = TRUE
+    "Join column must be present in data"
   )
+  expect_snapshot({
+    (expect_error(

Review comment:
       Yup, I like `expect_snapshot(error = TRUE)` and it looks like that's the 
way that the testthat developers are recommending moving forward too.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to