Copilot commented on code in PR #814:
URL: https://github.com/apache/sedona-db/pull/814#discussion_r3191140525


##########
r/sedonadb/R/join-expression.R:
##########
@@ -466,10 +690,45 @@ sd_build_default_select <- function(
   x_names <- names(join_expr_ctx$x_schema$children)
   y_names <- names(join_expr_ctx$y_schema$children)
 
-  # Extract equijoin key pairs (simple x$col == y$col conditions)
-  # and remove them from y_names. We do this even for right joins to match
+  # Extract simple key pairs (x$col == y$col or some_fun(x$col, y$col) 
conditions)
+  # if keep is not TRUE.
+  if (isTRUE(keep)) {
+    simple_join_keys <- list(x_cols = character(), y_cols = character(), op = 
character())
+  } else {
+    simple_join_keys <- sd_extract_simple_join_keys(join_conditions)
+  }

Review Comment:
   The new `keep` argument is never validated. Because this branch uses 
`isTRUE()`, values like `\"yes\"`, `1`, `NA`, or `c(TRUE, FALSE)` are silently 
treated as the default behavior instead of raising an error. Please validate 
`keep` as a length-1 logical so caller mistakes don't quietly change the join 
schema.



##########
r/sedonadb/R/dataframe-join.R:
##########
@@ -0,0 +1,170 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+#' Join two SedonaDB DataFrames
+#'
+#' Perform a join operation between two dataframes. Use [sd_join_by()] to
+#' specify join conditions using `x$column` and `y$column` syntax to
+#' reference columns from the left and right tables respectively.
+#'
+#' @param x The left dataframe
+#' @param y The right dataframe
+#' @param by Join specification. One of:
+#'   - A `sedonadb_join_by` object from [sd_join_by()]
+#'   - A `sedonadb_join_by` object from [sd_join_intersects()],
+#'     [sd_join_contains()], [sd_join_within()], [sd_join_covers()],
+#'     [sd_join_coveredby()], [sd_join_touches()], [sd_join_crosses()],
+#'     [sd_join_overlaps()], or [sd_join_equals()].
+#'   - A character vector of column names to join on in both tables
+#'   - A named character vector mapping left-table column names to
+#'     right-table column names, e.g. `c(x_val = "y_val")`
+#'   - `NULL` for a natural join on columns with matching names
+#' @param join_type The type of join to perform. One of "inner", "left", 
"right",
+#'   "full", "leftsemi", "rightsemi", "leftanti", "rightanti", "leftmark",
+#'   or "rightmark".
+#' @param select Post-join column selection. One of
+#'   - `NULL` for no modification, which may result in duplicate (unqualified)
+#'     column names. The column may still be
+#'     referred to with a qualifier in advanced usage using [sd_expr_column()].
+#'   - [sd_join_select_default()] for dplyr-like behaviour (equi-join keys
+#'     removed, intersecting names suffixed)

Review Comment:
   The default-select behavior is broader than this now documents. 
`sd_build_default_select()` also drops the right-hand geometry/key column for 
left and inner spatial predicate joins such as `sd_join_intersects()`, so 
callers will see a schema change even when the join is not an equijoin. Please 
document that spatial predicate joins are also affected here.
   



##########
r/sedonadb/R/dataframe-join.R:
##########
@@ -0,0 +1,170 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+#' Join two SedonaDB DataFrames
+#'
+#' Perform a join operation between two dataframes. Use [sd_join_by()] to
+#' specify join conditions using `x$column` and `y$column` syntax to
+#' reference columns from the left and right tables respectively.
+#'
+#' @param x The left dataframe
+#' @param y The right dataframe
+#' @param by Join specification. One of:
+#'   - A `sedonadb_join_by` object from [sd_join_by()]
+#'   - A `sedonadb_join_by` object from [sd_join_intersects()],
+#'     [sd_join_contains()], [sd_join_within()], [sd_join_covers()],
+#'     [sd_join_coveredby()], [sd_join_touches()], [sd_join_crosses()],
+#'     [sd_join_overlaps()], or [sd_join_equals()].
+#'   - A character vector of column names to join on in both tables
+#'   - A named character vector mapping left-table column names to
+#'     right-table column names, e.g. `c(x_val = "y_val")`
+#'   - `NULL` for a natural join on columns with matching names
+#' @param join_type The type of join to perform. One of "inner", "left", 
"right",
+#'   "full", "leftsemi", "rightsemi", "leftanti", "rightanti", "leftmark",
+#'   or "rightmark".
+#' @param select Post-join column selection. One of
+#'   - `NULL` for no modification, which may result in duplicate (unqualified)
+#'     column names. The column may still be
+#'     referred to with a qualifier in advanced usage using [sd_expr_column()].
+#'   - [sd_join_select_default()] for dplyr-like behaviour (equi-join keys
+#'     removed, intersecting names suffixed)
+#'   - [sd_join_select()] for a custom selection
+#' @param keep Use `TRUE` to keep all key column in an equijoin or spatial 
join.
+#'   This is only applied when using [sd_join_select_default()].
+#'
+#' @returns An object of class sedonadb_dataframe
+#' @export
+#'
+#' @examples
+#' df1 <- data.frame(x = letters[1:10], y = 1:10)
+#' df2 <- data.frame(y = 10:1, z = LETTERS[1:10])
+#' df1 |> sd_join(df2)
+#'
+sd_join <- function(
+  x,
+  y,
+  by = NULL,
+  join_type = "inner",
+  select = sd_join_select_default(),
+  keep = NULL
+) {
+  x <- as_sedonadb_dataframe(x)
+  y <- as_sedonadb_dataframe(y, ctx = x$ctx)
+
+  x_schema <- infer_nanoarrow_schema(x)
+  y_schema <- infer_nanoarrow_schema(y)
+  join_expr_ctx <- sd_join_expr_ctx(x_schema, y_schema, ctx = x$ctx)
+  join_conditions <- sd_build_join_conditions(join_expr_ctx, by, ctx = x$ctx)
+
+  df <- x$df$join(y$df, join_conditions, join_type, left_alias = "x", 
right_alias = "y")
+  out <- new_sedonadb_dataframe(x$ctx, df)
+
+  # Apply post-join column selection if needed
+  if (is.null(select)) {
+    projection <- NULL
+  } else if (inherits(select, "sedonadb_join_select_default")) {
+    # Default select: remove duplicate equijoin keys, apply suffixes
+    projection <- sd_build_default_select(
+      join_expr_ctx,
+      join_conditions,
+      select$suffix,
+      join_type,
+      keep = keep
+    )
+  } else if (inherits(select, "sedonadb_join_select")) {
+    # Custom select: evaluate user expressions
+    projection <- sd_eval_join_select_exprs(select, join_expr_ctx)
+  } else {
+    stop(
+      "`select` must be NULL, sd_join_select_default(), or sd_join_select()",
+      call. = FALSE
+    )
+  }
+
+  # NULL return from these functions means that no extra projecting is needed
+  if (is.null(projection)) {
+    out
+  } else {
+    sd_transmute(out, !!!projection)
+  }
+}
+
+#' @rdname sd_join
+#' @export
+sd_left_join <- function(
+  x,
+  y,
+  by = NULL,
+  select = sd_join_select_default(),
+  keep = NULL
+) {
+  sd_join(x, y, by = by, select = select, join_type = "left", keep = keep)
+}
+
+#' @rdname sd_join
+#' @export
+sd_right_join <- function(
+  x,
+  y,
+  by = NULL,
+  select = sd_join_select_default(),
+  keep = NULL
+) {
+  sd_join(x, y, by = by, select = select, join_type = "right", keep = keep)
+}
+
+#' @rdname sd_join
+#' @export
+sd_inner_join <- function(
+  x,
+  y,
+  by = NULL,
+  select = sd_join_select_default(),
+  keep = NULL
+) {
+  sd_join(x, y, by = by, select = select, join_type = "inner", keep = keep)
+}
+
+#' @rdname sd_join
+#' @export
+sd_full_join <- function(
+  x,
+  y,
+  by = NULL,
+  select = sd_join_select_default(),
+  keep = NULL
+) {
+  sd_join(x, y, by = by, select = select, join_type = "full", keep = keep)
+}
+
+#' @rdname sd_join
+#' @export
+sd_semi_join <- function(x, y, by = NULL) {
+  sd_join(x, y, by = by, join_type = "leftsemi")
+}
+
+#' @rdname sd_join
+#' @export
+sd_anti_join <- function(x, y, by = NULL) {
+  sd_join(x, y, by = by, join_type = "leftanti")
+}
+
+#' @rdname sd_join
+#' @export
+sd_cross_join <- function(x, y, select = sd_join_select_default()) {
+  sd_join(x, y, by = character(), select = select, join_type = "inner")
+}

Review Comment:
   This new public wrapper is not exercised in the added test files, unlike the 
other join-type helpers. Please add a smoke test for cartesian row counts and 
default suffixing so regressions in the `by = character()` implementation are 
caught.



##########
r/sedonadb/R/join-expression.R:
##########
@@ -590,5 +851,7 @@ sd_extract_equijoin_keys <- function(join_conditions) {
     )
   }
 
-  list(x_cols = x_cols, y_cols = y_cols)
+  list(x_cols = x_cols, y_cols = y_cols, op = ops)
 }
+
+all_spatial_predicates <- c("st_intersects", "st_contains", "st_within", 
"st_equals")

Review Comment:
   This introduces a second hard-coded spatial-predicate list that is already 
out of sync with the helpers and `equijoin_ops` above (`st_covers`, 
`st_coveredby`, `st_touches`, `st_crosses`, and `st_overlaps` are missing), and 
it's unused in this patch. Keeping an incomplete duplicate list makes future 
behavior drift much more likely.
   



##########
r/sedonadb/R/dataframe-join.R:
##########
@@ -0,0 +1,170 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+#' Join two SedonaDB DataFrames
+#'
+#' Perform a join operation between two dataframes. Use [sd_join_by()] to
+#' specify join conditions using `x$column` and `y$column` syntax to
+#' reference columns from the left and right tables respectively.
+#'
+#' @param x The left dataframe
+#' @param y The right dataframe

Review Comment:
   The old documentation noted that `y` is coerced into the same context as 
`x`, but that behavior is still present in the implementation and is no longer 
described here. Restoring that note will make cross-context joins less 
surprising for API consumers.
   



##########
r/sedonadb/R/dataframe-join.R:
##########
@@ -0,0 +1,170 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+#' Join two SedonaDB DataFrames
+#'
+#' Perform a join operation between two dataframes. Use [sd_join_by()] to
+#' specify join conditions using `x$column` and `y$column` syntax to
+#' reference columns from the left and right tables respectively.
+#'
+#' @param x The left dataframe
+#' @param y The right dataframe
+#' @param by Join specification. One of:
+#'   - A `sedonadb_join_by` object from [sd_join_by()]
+#'   - A `sedonadb_join_by` object from [sd_join_intersects()],
+#'     [sd_join_contains()], [sd_join_within()], [sd_join_covers()],
+#'     [sd_join_coveredby()], [sd_join_touches()], [sd_join_crosses()],
+#'     [sd_join_overlaps()], or [sd_join_equals()].
+#'   - A character vector of column names to join on in both tables
+#'   - A named character vector mapping left-table column names to
+#'     right-table column names, e.g. `c(x_val = "y_val")`
+#'   - `NULL` for a natural join on columns with matching names
+#' @param join_type The type of join to perform. One of "inner", "left", 
"right",
+#'   "full", "leftsemi", "rightsemi", "leftanti", "rightanti", "leftmark",
+#'   or "rightmark".
+#' @param select Post-join column selection. One of
+#'   - `NULL` for no modification, which may result in duplicate (unqualified)
+#'     column names. The column may still be
+#'     referred to with a qualifier in advanced usage using [sd_expr_column()].
+#'   - [sd_join_select_default()] for dplyr-like behaviour (equi-join keys
+#'     removed, intersecting names suffixed)
+#'   - [sd_join_select()] for a custom selection
+#' @param keep Use `TRUE` to keep all key column in an equijoin or spatial 
join.
+#'   This is only applied when using [sd_join_select_default()].
+#'
+#' @returns An object of class sedonadb_dataframe
+#' @export
+#'
+#' @examples
+#' df1 <- data.frame(x = letters[1:10], y = 1:10)
+#' df2 <- data.frame(y = 10:1, z = LETTERS[1:10])
+#' df1 |> sd_join(df2)
+#'
+sd_join <- function(
+  x,
+  y,
+  by = NULL,
+  join_type = "inner",
+  select = sd_join_select_default(),
+  keep = NULL
+) {
+  x <- as_sedonadb_dataframe(x)
+  y <- as_sedonadb_dataframe(y, ctx = x$ctx)

Review Comment:
   The old documentation noted that `y` is coerced into the same context as 
`x`, but that behavior is still present in the implementation and is no longer 
described here. Restoring that note will make cross-context joins less 
surprising for API consumers.



##########
r/sedonadb/R/dataframe-join.R:
##########
@@ -0,0 +1,170 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+#' Join two SedonaDB DataFrames
+#'
+#' Perform a join operation between two dataframes. Use [sd_join_by()] to
+#' specify join conditions using `x$column` and `y$column` syntax to
+#' reference columns from the left and right tables respectively.
+#'
+#' @param x The left dataframe
+#' @param y The right dataframe
+#' @param by Join specification. One of:
+#'   - A `sedonadb_join_by` object from [sd_join_by()]
+#'   - A `sedonadb_join_by` object from [sd_join_intersects()],
+#'     [sd_join_contains()], [sd_join_within()], [sd_join_covers()],
+#'     [sd_join_coveredby()], [sd_join_touches()], [sd_join_crosses()],
+#'     [sd_join_overlaps()], or [sd_join_equals()].
+#'   - A character vector of column names to join on in both tables
+#'   - A named character vector mapping left-table column names to
+#'     right-table column names, e.g. `c(x_val = "y_val")`
+#'   - `NULL` for a natural join on columns with matching names
+#' @param join_type The type of join to perform. One of "inner", "left", 
"right",
+#'   "full", "leftsemi", "rightsemi", "leftanti", "rightanti", "leftmark",
+#'   or "rightmark".
+#' @param select Post-join column selection. One of
+#'   - `NULL` for no modification, which may result in duplicate (unqualified)
+#'     column names. The column may still be
+#'     referred to with a qualifier in advanced usage using [sd_expr_column()].
+#'   - [sd_join_select_default()] for dplyr-like behaviour (equi-join keys
+#'     removed, intersecting names suffixed)
+#'   - [sd_join_select()] for a custom selection
+#' @param keep Use `TRUE` to keep all key column in an equijoin or spatial 
join.
+#'   This is only applied when using [sd_join_select_default()].

Review Comment:
   This wording implies that `keep` changes all spatial joins, but the new 
implementation only uses spatial predicates as removable keys for left/inner 
joins; right/full spatial joins already keep both geometry columns even when 
`keep` is `NULL`. Please narrow the description to the cases where `keep` 
actually changes output, and fix the grammar in 'all key column'.
   



##########
r/sedonadb/R/join-expression.R:
##########
@@ -57,6 +73,60 @@ sd_join_by <- function(...) {
   )
 }
 
+#' @rdname sd_join_by
+#' @export
+sd_join_intersects <- function() {
+  sd_join_by(.fns$st_intersects(.tables$x$geom(), .tables$y$geom()))
+}

Review Comment:
   These shorthand spatial join helpers hard-code `.tables$x$geom()` / 
`.tables$y$geom()`, so they only work when each side has exactly one geometry 
column. The public docs currently present them as general shorthands, but they 
should explicitly call out this single-geometry-column requirement.



##########
r/sedonadb/R/join-expression.R:
##########
@@ -109,21 +190,39 @@ sd_join_expr_ctx <- function(
   names(y_cols) <- y_names
 
   # Create table reference objects that support `$` access
-  x_ref <- structure(x_cols, class = "sedonadb_table_ref", qualifier = 
x_qualifier)
-  y_ref <- structure(y_cols, class = "sedonadb_table_ref", qualifier = 
y_qualifier)
+  x_ref <- structure(
+    x_cols,
+    class = "sedonadb_table_ref",
+    qualifier = x_qualifier,
+    schema = x_schema
+  )
+  y_ref <- structure(
+    y_cols,
+    class = "sedonadb_table_ref",
+    qualifier = y_qualifier,
+    schema = y_schema
+  )
 
   # Also include unqualified column references for unambiguous columns
   ambiguous <- intersect(x_names, y_names)
+
+  # Create data mask with unambiguous columns
   data <- c(x_cols[setdiff(x_names, ambiguous)], y_cols[setdiff(y_names, 
ambiguous)])
+  data_mask <- rlang::as_data_mask(data)
+
+  # Install .tables pronoun for accessing table references programmatically
+  # (e.g., .tables$x$geom, .tables$y$geom)
+  table_refs <- list(x = x_ref, y = y_ref)
+  rlang::env_bind(data_mask, .tables = rlang::as_data_pronoun(table_refs))

Review Comment:
   The new `.tables` binding is only exercised in join-condition tests, but the 
docs also advertise it for `sd_join_select()` expressions. Adding a 
select-expression test that uses `.tables` would make sure this binding keeps 
working in both evaluation paths.



##########
r/sedonadb/R/join-expression.R:
##########
@@ -57,6 +73,60 @@ sd_join_by <- function(...) {
   )
 }
 
+#' @rdname sd_join_by
+#' @export
+sd_join_intersects <- function() {
+  sd_join_by(.fns$st_intersects(.tables$x$geom(), .tables$y$geom()))
+}
+
+#' @rdname sd_join_by
+#' @export
+sd_join_contains <- function() {
+  sd_join_by(.fns$st_contains(.tables$x$geom(), .tables$y$geom()))
+}
+
+#' @rdname sd_join_by
+#' @export
+sd_join_within <- function() {
+  sd_join_by(.fns$st_within(.tables$x$geom(), .tables$y$geom()))
+}
+
+#' @rdname sd_join_by
+#' @export
+sd_join_covers <- function() {
+  sd_join_by(.fns$st_covers(.tables$x$geom(), .tables$y$geom()))
+}
+
+#' @rdname sd_join_by
+#' @export
+sd_join_coveredby <- function() {
+  sd_join_by(.fns$st_coveredby(.tables$x$geom(), .tables$y$geom()))

Review Comment:
   The added tests exercise `sd_join_intersects()`, but the other newly 
exported helper constructors in this block (`sd_join_contains()`, 
`sd_join_within()`, `sd_join_covers()`, `sd_join_coveredby()`, 
`sd_join_touches()`, `sd_join_crosses()`, `sd_join_overlaps()`, and 
`sd_join_equals()`) are not covered. A small parameterized test that each 
helper expands to the expected predicate name would catch typo-level 
regressions in these wrappers.
   



##########
r/sedonadb/R/join-expression.R:
##########
@@ -20,13 +20,23 @@
 #' Use `sd_join_by()` to specify join conditions for [sd_join()] using
 #' expressions that reference columns from both tables. Table references
 #' are specified using `x$column` and `y$column` syntax to disambiguate
-#' columns from the left and right tables.
+#' columns from the left and right tables, and the special helper `x$geom()`
+#' and `y$geom()` may be used for tables with exactly one geometry column.
+#' Spatial joins can use spatial predicates in the join by expression
+#' (e.g., `sd_join_by(st_intersects(x$geom(), y$geom())`) or the shorthand

Review Comment:
   The final `)` of `sd_join_by(...)` sits outside the inline code span here, 
so the rendered example shows malformed code formatting. Please move the 
closing parenthesis inside the backticks so users can copy the example as 
written.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to