[GitHub] [arrow] nealrichardson commented on a diff in pull request #14286: ARROW-15838: [R] Coalesce join keys in full outer join

GitBox Mon, 10 Oct 2022 07:32:36 -0700


nealrichardson commented on code in PR #14286:
URL: https://github.com/apache/arrow/pull/14286#discussion_r991354505



##########
r/R/query-engine.R:
##########
@@ -384,3 +397,65 @@ needs_projection <- function(projection, schema) {
     !identical(field_names, names(projection)) || # Any fields are renamed
     !identical(field_names, names(schema)) # The fields are reordered
 }
+
+#' Create projection needed to coalesce join keys after a full outer join
+#'
+#' @examples
+#' test_join <- list(
+#'   type = JoinType$FULL_OUTER,
+#'   right_data = arrow_table(x = 1, y = 2, z = "x"),
+#'   by = c("x", "y"),
+#'   suffix = c(".x", ".y"),
+#'   keep = FALSE
+#' )
+#' post_join_projection(c("value", "x", "y", "z"), test_join)
+#'
+#' @noRd
+post_join_projection <- function(left_names, join_config) {
+  # Need to coalesce the key columns
+  right_names <- names(join_config$right_data)
+  # Collect mapping of which columns on left need to be coalesced with which
+  # column on the right side.
+  coalesce_targets <- data.frame(

Review Comment:
   Rather than a data.frame, would a named list or character vector be a 
simpler map to work with?



##########
r/R/dplyr-collect.R:
##########
@@ -123,15 +123,25 @@ implicit_schema <- function(.data) {
       # Add cols from right side, except for semi/anti joins
       right_cols <- .data$join$right_data$selected_columns
       left_cols <- .data$selected_columns
-      right_fields <- map(
-        right_cols[setdiff(names(right_cols), .data$join$by)],
-        ~ .$type(.data$join$right_data$.data$schema)
-      )
-      # get right table and left table column names excluding the join key
-      right_cols_ex_by <- right_cols[setdiff(names(right_cols), .data$join$by)]
-      left_cols_ex_by <- left_cols[setdiff(names(left_cols), .data$join$by)]
-      # find the common column names in left and right tables
-      common_cols <- intersect(names(right_cols_ex_by), names(left_cols_ex_by))
+
+      # If keep = TRUE, we want to keep the key columns in the RHS. Otherwise,
+      # they will be dropped.
+      if (.data$join$keep) {
+        # find the common column names in left and right tables
+        common_cols <- intersect(names(right_cols), names(left_cols))
+        right_fields <- map(right_cols, ~ 
.$type(.data$join$right_data$.data$schema))
+      } else {
+        right_fields <- map(
+          right_cols[setdiff(names(right_cols), .data$join$by)],
+          ~ .$type(.data$join$right_data$.data$schema)
+        )
+        # get right table and left table column names excluding the join key

Review Comment:
   ```suggestion
           # get right table and left table column projections excluding the 
join key(s)
   ```



##########
r/src/compute-exec.cpp:
##########
@@ -404,37 +403,16 @@ std::shared_ptr<compute::ExecNode> ExecNode_Join(
   for (auto&& name : left_output) {
     left_out_refs.emplace_back(std::move(name));
   }
-  if (type != 0 && type != 2) {
+  // dplyr::semi_join => LEFT_SEMI; dplyr::anti_join => LEFT_ANTI
+  // So ignoring RIGHT_SEMI and RIGHT_ANTI here.

Review Comment:
   ```suggestion
     // So ignoring RIGHT_SEMI and RIGHT_ANTI here because dplyr doesn't 
implement them.
   ```



##########
r/R/query-engine.R:
##########
@@ -165,6 +172,12 @@ ExecPlan <- R6Class("ExecPlan",
             left_suffix = .data$join$suffix[[1]],
             right_suffix = .data$join$suffix[[2]]
           )
+          # If we are doing a full outer join and not keeping the join keys of

Review Comment:
   Would it make sense to move this projection step into 
`full_join.arrow_dplyr_query()`? I could see arguments either way, but it might 
be cleaner to move the logic there so that there is less special handling in 
`ExecPlan$build()`, so that it can more simply just render the query plan it is 
given.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] nealrichardson commented on a diff in pull request #14286: ARROW-15838: [R] Coalesce join keys in full outer join

Reply via email to