Re: [PR] [SPARK-56395][SQL] Add NEAREST BY top-K ranking join (catalyst-side) [spark]

via GitHub Fri, 01 May 2026 10:59:57 -0700


gengliangwang commented on code in PR #55629:
URL: https://github.com/apache/spark/pull/55629#discussion_r3174453758



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteNearestByJoin.scala:
##########
@@ -0,0 +1,125 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.optimizer
+
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.expressions.aggregate._
+import org.apache.spark.sql.catalyst.plans._
+import org.apache.spark.sql.catalyst.plans.logical._
+import org.apache.spark.sql.catalyst.rules._
+
+/**
+ * Replaces a logical [[NearestByJoin]] operator with a 
`Generate(Inline(...))` over an
+ * `Aggregate` that tags each left row with a unique id, cross-joins with the 
right side, and
+ * groups by the unique id to compute the top-K matches via `MAX_BY`/`MIN_BY` 
(K-overload).
+ *
+ * Input Pseudo-Query:
+ * {{{
+ *    SELECT * FROM left [INNER | LEFT OUTER] JOIN right
+ *      {APPROX | EXACT} NEAREST k BY {DISTANCE | SIMILARITY} expr
+ * }}}
+ *
+ * Rewritten Plan (SIMILARITY, INNER join type):
+ * {{{
+ *    Generate inline(_matches), [N], outer=false, [right.col1, right.col2, 
...]
+ *      +- Aggregate [__qid],
+ *           [first(left.col0) AS left.col0, ..., first(left.colN-1) AS 
left.colN-1,
+ *            max_by(struct(right.*), expr, k) AS _matches]
+ *          +- Join Inner
+ *             :- Project [left.*, monotonically_increasing_id() AS __qid]
+ *             :  +- left
+ *             +- right
+ * }}}
+ *
+ * For `DISTANCE`, `MIN_BY` is used instead of `MAX_BY`. For `LEFT OUTER`, the 
`Generate` is
+ * constructed with `outer = true` so left rows with no matches (empty/null 
`_matches`) are
+ * preserved with `NULL` right-side columns.
+ *
+ * In this initial implementation both `APPROX` and `EXACT` take the same 
brute-force rewrite

Review Comment:
   **`APPROX` is currently dead weight.** It's parsed, stored on the node, and 
(only) gates the deterministic-ranking check, but the rewrite path is identical 
for both modes — there's no `Strategy`/rule that picks a different physical 
plan, and no test that exercises the user-visible difference between the two 
modes other than the determinism guard.
   
   Would you consider either: (1) dropping `approx` from the node and 
reintroducing it when the first ANN strategy lands, or (2) referencing a 
follow-up subtask here so reviewers know what behavior to expect from this 
field? As-is, `APPROX` is a public surface (grammar, `EXPLAIN` output) that 
does nothing.



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteNearestByJoin.scala:
##########
@@ -0,0 +1,125 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.optimizer
+
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.expressions.aggregate._
+import org.apache.spark.sql.catalyst.plans._
+import org.apache.spark.sql.catalyst.plans.logical._
+import org.apache.spark.sql.catalyst.rules._
+
+/**
+ * Replaces a logical [[NearestByJoin]] operator with a 
`Generate(Inline(...))` over an
+ * `Aggregate` that tags each left row with a unique id, cross-joins with the 
right side, and
+ * groups by the unique id to compute the top-K matches via `MAX_BY`/`MIN_BY` 
(K-overload).
+ *
+ * Input Pseudo-Query:
+ * {{{
+ *    SELECT * FROM left [INNER | LEFT OUTER] JOIN right
+ *      {APPROX | EXACT} NEAREST k BY {DISTANCE | SIMILARITY} expr
+ * }}}
+ *
+ * Rewritten Plan (SIMILARITY, INNER join type):
+ * {{{
+ *    Generate inline(_matches), [N], outer=false, [right.col1, right.col2, 
...]
+ *      +- Aggregate [__qid],
+ *           [first(left.col0) AS left.col0, ..., first(left.colN-1) AS 
left.colN-1,
+ *            max_by(struct(right.*), expr, k) AS _matches]
+ *          +- Join Inner
+ *             :- Project [left.*, monotonically_increasing_id() AS __qid]
+ *             :  +- left
+ *             +- right
+ * }}}
+ *
+ * For `DISTANCE`, `MIN_BY` is used instead of `MAX_BY`. For `LEFT OUTER`, the 
`Generate` is
+ * constructed with `outer = true` so left rows with no matches (empty/null 
`_matches`) are
+ * preserved with `NULL` right-side columns.
+ *
+ * In this initial implementation both `APPROX` and `EXACT` take the same 
brute-force rewrite
+ * path. `APPROX` establishes the contract for future indexed-ANN strategies.
+ */
+object RewriteNearestByJoin extends Rule[LogicalPlan] {
+  def apply(plan: LogicalPlan): LogicalPlan = plan.transformUpWithNewOutput {
+    case j @ NearestByJoin(left, right, joinType, _, numResults, 
rankingExpression, direction) =>
+      // 1. Tag each left row with a unique id so that rows from the same left 
row can later be
+      //    grouped together after the cross-join with `right`.
+      val qidAlias = Alias(MonotonicallyIncreasingID(), "__qid")()
+      val taggedLeft = Project(left.output :+ qidAlias, left)
+      val qidAttr = qidAlias.toAttribute
+
+      // 2. LEFT OUTER-join the tagged left with right (no join condition). 
LEFT OUTER
+      //    (rather than INNER) preserves left rows even when `right` is 
empty, so that a
+      //    `LEFT OUTER NEAREST BY` query still returns those rows with `NULL` 
right-side
+      //    columns after the aggregate + inline below. When `right` is 
non-empty every left
+      //    row already has right-row pairings, so LEFT OUTER and INNER are 
equivalent.
+      //
+      //    Tag the join so `CheckCartesianProducts` skips it: the rewrite 
intentionally
+      //    materializes a cross product bounded by the downstream `MaxMinByK` 
aggregate, so
+      //    `spark.sql.crossJoin.enabled = false` should not reject user 
queries written as
+      //    `NEAREST BY`.
+      val join = Join(taggedLeft, right, LeftOuter, None, JoinHint.NONE)
+      join.setTagValue(NearestByJoin.SYNTHETIC_JOIN_TAG, ())
+
+      // 3. Aggregate grouped by `__qid`:
+      //      - first(col) for every left column so it flows to the output.
+      //      - max_by/min_by(struct(right.*), ranking, k) as `_matches`.
+      //    The ranking expression references left and right columns directly; 
no outer
+      //    reference is needed because both sides are present in the joined 
input.
+      val rightStruct = CreateStruct(right.output)
+      // reverse = true  -> MIN_BY (smallest ranking value first, for DISTANCE)
+      // reverse = false -> MAX_BY (largest ranking value first, for 
SIMILARITY)
+      val reverse = direction match {
+        case NearestByDistance => true
+        case NearestBySimilarity => false
+      }
+      val topK = MaxMinByK(
+        rightStruct,
+        rankingExpression,
+        Literal(numResults),
+        reverse = reverse).toAggregateExpression()
+      val matchesAlias = Alias(topK, "__nearest_matches__")()
+
+      // Carry left columns through with `First`. Within a `__qid` group every 
row has the same
+      // left values (each group corresponds to one left row), so `First` is 
effectively a no-op.
+      // We use `First` rather than adding all left columns to the GROUP BY 
because grouping by
+      // `__qid` alone keeps the shuffle key small.
+      val firstLeftAggs = left.output.map { attr =>
+        Alias(
+          First(attr, ignoreNulls = false).toAggregateExpression(),
+          attr.name)(exprId = attr.exprId, qualifier = attr.qualifier)
+      }
+      val aggregate = Aggregate(Seq(qidAttr), firstLeftAggs :+ matchesAlias, 
join)
+
+      // 4. Generate inline(_matches) expands the K-element array into K rows, 
exposing each
+      //    struct field as a top-level column. `outer = true` for LEFT OUTER 
preserves the
+      //    left row with NULL right columns when there are no matches.
+      val generatorOutput = right.output.map { a =>
+        AttributeReference(a.name, a.dataType, nullable = true)(qualifier = 
a.qualifier)
+      }

Review Comment:
   **`generatorOutput` drops right-side metadata.** If the right relation has 
columns with `Metadata` (comments, generated-column markers, column-level 
config), it's silently lost from the rewritten plan's output schema. Forward 
`a.metadata` explicitly:
   
   ```suggestion
         val generatorOutput = right.output.map { a =>
           AttributeReference(a.name, a.dataType, nullable = true, 
a.metadata)(qualifier = a.qualifier)
         }
   ```



##########
common/utils/src/main/resources/error/error-conditions.json:
##########
@@ -5313,6 +5313,49 @@
     ],
     "sqlState" : "0A000"
   },
+  "NEAREST_BY_JOIN" : {
+    "message" : [
+      "Invalid nearest-by join."
+    ],
+    "subClass" : {
+      "EXACT_WITH_NONDETERMINISTIC_EXPRESSION" : {
+        "message" : [
+          "EXACT nearest-by join is incompatible with the nondeterministic 
ranking expression <expression>. Use APPROX, or replace the expression with a 
deterministic one."
+        ]
+      },
+      "NON_ORDERABLE_RANKING_EXPRESSION" : {
+        "message" : [
+          "The ranking expression <expression> of type <type> is not 
orderable."
+        ]
+      },
+      "NUM_RESULTS_OUT_OF_RANGE" : {
+        "message" : [
+          "The number of results <numResults> must be between <min> and <max>."
+        ]
+      },
+      "STREAMING_NOT_SUPPORTED" : {
+        "message" : [
+          "NEAREST BY join is not supported with streaming 
DataFrames/Datasets."

Review Comment:
   Capitalization is inconsistent with the other six `NEAREST_BY_JOIN.*` 
sub-messages, which all use `nearest-by join`.
   
   ```suggestion
             "nearest-by join is not supported with streaming 
DataFrames/Datasets."
   ```



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala:
##########
@@ -2420,3 +2420,62 @@ object AsOfJoin {
     }
   }
 }
+
+object NearestByJoin {
+  /** Upper bound on `numResults`. Mirrors the K-overload limit of 
`MaxMinByK`. */
+  val MaxNumResults: Int = 100000
+
+  /**
+   * Tag set by `RewriteNearestByJoin` on the synthetic `Join` it produces. 
The synthetic join
+   * has no condition by construction (it is the cross-join step in the 
rewrite, bounded by the
+   * subsequent `MaxMinByK` aggregate). `CheckCartesianProducts` skips any 
join carrying this
+   * tag so that user queries written as `NEAREST BY` are not rejected when
+   * `spark.sql.crossJoin.enabled` is set to `false`.
+   */
+  val SYNTHETIC_JOIN_TAG: TreeNodeTag[Unit] = 
TreeNodeTag("nearestBySyntheticJoin")
+}
+
+/**

Review Comment:
   Minor grammar nit:
   
   ```suggestion
   /**
    * A logical plan for a nearest-by top-K ranking join. For each row on the 
left side it returns up to
   ```



##########
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/RewriteNearestByJoinSuite.scala:
##########
@@ -0,0 +1,115 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.optimizer
+
+import org.apache.spark.sql.catalyst.dsl.expressions._
+import org.apache.spark.sql.catalyst.dsl.plans._
+import org.apache.spark.sql.catalyst.expressions.{Alias, AttributeReference, 
CreateStruct, Inline, Literal, MonotonicallyIncreasingID}
+import org.apache.spark.sql.catalyst.expressions.aggregate.{First, MaxMinByK}
+import org.apache.spark.sql.catalyst.plans.{Inner, LeftOuter, 
NearestByDistance, NearestBySimilarity, PlanTest}
+import org.apache.spark.sql.catalyst.plans.logical.{Aggregate, Generate, Join, 
JoinHint, LocalRelation, NearestByJoin, Project}
+
+class RewriteNearestByJoinSuite extends PlanTest {
+
+  private def expectedRewrite(
+      left: LocalRelation,
+      right: LocalRelation,
+      numResults: Int,
+      ranking: org.apache.spark.sql.catalyst.expressions.Expression,
+      reverse: Boolean,
+      outer: Boolean) = {
+    val qidAlias = Alias(MonotonicallyIncreasingID(), "__qid")()
+    val taggedLeft = Project(left.output :+ qidAlias, left)
+    val join = Join(taggedLeft, right, LeftOuter, None, JoinHint.NONE)
+
+    val rightStruct = CreateStruct(right.output)
+    val topKAgg = MaxMinByK(
+      rightStruct, ranking, Literal(numResults), reverse = reverse)
+      .toAggregateExpression()
+    val matchesAlias = Alias(topKAgg, "__nearest_matches__")()
+    val firstLeftAggs = left.output.map { attr =>
+      Alias(
+        First(attr, ignoreNulls = false).toAggregateExpression(),
+        attr.name)(exprId = attr.exprId, qualifier = attr.qualifier)
+    }
+    val aggregate = Aggregate(
+      Seq(qidAlias.toAttribute), firstLeftAggs :+ matchesAlias, join)
+
+    val generatorOutput = right.output.map { a =>
+      AttributeReference(a.name, a.dataType, nullable = true)(qualifier = 
a.qualifier)
+    }
+    Generate(
+      Inline(matchesAlias.toAttribute),
+      unrequiredChildIndex = 
Seq(aggregate.output.indexOf(matchesAlias.toAttribute)),
+      outer = outer,
+      qualifier = None,
+      generatorOutput = generatorOutput,
+      child = aggregate)
+  }
+
+  test("similarity, inner, k=5") {
+    val left = LocalRelation($"a".int, $"b".int)
+    val right = LocalRelation($"x".int, $"y".int)
+    val query = NearestByJoin(
+      left, right, Inner, approx = true, numResults = 5,
+      rankingExpression = left.output(0) + right.output(0),
+      direction = NearestBySimilarity)
+
+    val rewritten = RewriteNearestByJoin(query.analyze)
+    val expected = expectedRewrite(
+      left, right, 5,
+      ranking = left.output(0) + right.output(0),
+      reverse = false, outer = false)
+
+    comparePlans(rewritten, expected, checkAnalysis = false)
+  }

Review Comment:
   **Structural-only test suite.** `expectedRewrite` is a 1:1 mirror of the 
production rule, so `comparePlans(rewritten, expected)` only catches accidental 
edits to the production code — any logic bug in the rewrite that's also copied 
here passes. Worth adding:
   
   1. An end-to-end test that runs a small query through the planner and checks 
**result rows** (correct K matches, correct order, ties, empty right side for 
`LEFT OUTER`).
   2. A test with `spark.sql.crossJoin.enabled = false` to lock in the headline 
behavior of the synthetic-join tag.
   3. A test for `APPROX + nondeterministic ranking` (the only user-visible 
behavioral difference between APPROX and EXACT today).



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteNearestByJoin.scala:
##########
@@ -0,0 +1,125 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.optimizer
+
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.expressions.aggregate._
+import org.apache.spark.sql.catalyst.plans._
+import org.apache.spark.sql.catalyst.plans.logical._
+import org.apache.spark.sql.catalyst.rules._
+
+/**
+ * Replaces a logical [[NearestByJoin]] operator with a 
`Generate(Inline(...))` over an
+ * `Aggregate` that tags each left row with a unique id, cross-joins with the 
right side, and
+ * groups by the unique id to compute the top-K matches via `MAX_BY`/`MIN_BY` 
(K-overload).
+ *
+ * Input Pseudo-Query:
+ * {{{
+ *    SELECT * FROM left [INNER | LEFT OUTER] JOIN right
+ *      {APPROX | EXACT} NEAREST k BY {DISTANCE | SIMILARITY} expr
+ * }}}
+ *
+ * Rewritten Plan (SIMILARITY, INNER join type):
+ * {{{
+ *    Generate inline(_matches), [N], outer=false, [right.col1, right.col2, 
...]
+ *      +- Aggregate [__qid],
+ *           [first(left.col0) AS left.col0, ..., first(left.colN-1) AS 
left.colN-1,
+ *            max_by(struct(right.*), expr, k) AS _matches]
+ *          +- Join Inner
+ *             :- Project [left.*, monotonically_increasing_id() AS __qid]
+ *             :  +- left
+ *             +- right
+ * }}}
+ *
+ * For `DISTANCE`, `MIN_BY` is used instead of `MAX_BY`. For `LEFT OUTER`, the 
`Generate` is
+ * constructed with `outer = true` so left rows with no matches (empty/null 
`_matches`) are
+ * preserved with `NULL` right-side columns.
+ *
+ * In this initial implementation both `APPROX` and `EXACT` take the same 
brute-force rewrite
+ * path. `APPROX` establishes the contract for future indexed-ANN strategies.
+ */
+object RewriteNearestByJoin extends Rule[LogicalPlan] {
+  def apply(plan: LogicalPlan): LogicalPlan = plan.transformUpWithNewOutput {
+    case j @ NearestByJoin(left, right, joinType, _, numResults, 
rankingExpression, direction) =>
+      // 1. Tag each left row with a unique id so that rows from the same left 
row can later be
+      //    grouped together after the cross-join with `right`.
+      val qidAlias = Alias(MonotonicallyIncreasingID(), "__qid")()
+      val taggedLeft = Project(left.output :+ qidAlias, left)
+      val qidAttr = qidAlias.toAttribute
+
+      // 2. LEFT OUTER-join the tagged left with right (no join condition). 
LEFT OUTER
+      //    (rather than INNER) preserves left rows even when `right` is 
empty, so that a
+      //    `LEFT OUTER NEAREST BY` query still returns those rows with `NULL` 
right-side
+      //    columns after the aggregate + inline below. When `right` is 
non-empty every left
+      //    row already has right-row pairings, so LEFT OUTER and INNER are 
equivalent.
+      //
+      //    Tag the join so `CheckCartesianProducts` skips it: the rewrite 
intentionally
+      //    materializes a cross product bounded by the downstream `MaxMinByK` 
aggregate, so
+      //    `spark.sql.crossJoin.enabled = false` should not reject user 
queries written as
+      //    `NEAREST BY`.
+      val join = Join(taggedLeft, right, LeftOuter, None, JoinHint.NONE)
+      join.setTagValue(NearestByJoin.SYNTHETIC_JOIN_TAG, ())

Review Comment:
   **Tag fragility across optimizer batches.** This rule runs in `Finish 
Analysis`; `CheckCartesianProducts` runs many batches later. Several 
intervening rules (`PushPredicateThroughJoin`, `EliminateOuterJoin`, 
`ReorderJoin`, `PushDownLeftSemiAntiJoin`, ...) construct new `Join` nodes via 
the case-class constructor and do not call `copyTagsFrom`. If any of them fires 
on this synthetic join, the replacement `Join` is born tagless and 
`CheckCartesianProducts` will throw on a user-written `NEAREST BY` query when 
`spark.sql.crossJoin.enabled = false`.
   
   Options: (a) run this rule *after* `CheckCartesianProducts`; (b) detect the 
rewrite shape structurally in `CheckCartesianProducts` (e.g., parent is the 
`MaxMinByK` aggregate) rather than via a tag; (c) add tests that exercise the 
synthetic join *after* predicate pushdown / outer-join elimination with 
`crossJoin.enabled=false` to demonstrate no intervening rule strips the tag. 
The current suite tests neither the cross-join config nor any post-rewrite 
optimization.



##########
sql/api/src/main/scala/org/apache/spark/sql/errors/QueryParsingErrors.scala:
##########
@@ -203,6 +203,31 @@ private[sql] object QueryParsingErrors extends 
DataTypeErrorsBase {
       ctx)
   }
 
+  def nearestByJoinWithLateralUnsupportedError(ctx: ParserRuleContext): 
Throwable = {
+    new ParseException(
+      errorClass = "UNSUPPORTED_FEATURE.LATERAL_JOIN_NEAREST_BY",
+      messageParameters = Map.empty,
+      ctx)
+  }
+
+  def unsupportedNearestByJoinTypeError(ctx: ParserRuleContext, joinType: 
String): Throwable = {
+    new ParseException(
+      errorClass = "NEAREST_BY_JOIN.UNSUPPORTED_JOIN_TYPE",
+      messageParameters =
+        Map("joinType" -> toSQLStmt(joinType), "supported" -> "'INNER', 'LEFT 
OUTER'"),
+      ctx)

Review Comment:
   Use `NearestByJoinType.supportedDisplay` instead of hardcoding the same 
string in two places — that constant exists precisely so the SQL and DataFrame 
paths stay in sync.
   
   ```suggestion
           Map("joinType" -> toSQLStmt(joinType), "supported" -> 
NearestByJoinType.supportedDisplay),
   ```
   (Will need a `NearestByJoinType` import.)



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteNearestByJoin.scala:
##########
@@ -0,0 +1,125 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.optimizer
+
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.expressions.aggregate._
+import org.apache.spark.sql.catalyst.plans._
+import org.apache.spark.sql.catalyst.plans.logical._
+import org.apache.spark.sql.catalyst.rules._
+
+/**
+ * Replaces a logical [[NearestByJoin]] operator with a 
`Generate(Inline(...))` over an
+ * `Aggregate` that tags each left row with a unique id, cross-joins with the 
right side, and
+ * groups by the unique id to compute the top-K matches via `MAX_BY`/`MIN_BY` 
(K-overload).
+ *
+ * Input Pseudo-Query:
+ * {{{
+ *    SELECT * FROM left [INNER | LEFT OUTER] JOIN right
+ *      {APPROX | EXACT} NEAREST k BY {DISTANCE | SIMILARITY} expr
+ * }}}
+ *
+ * Rewritten Plan (SIMILARITY, INNER join type):
+ * {{{
+ *    Generate inline(_matches), [N], outer=false, [right.col1, right.col2, 
...]
+ *      +- Aggregate [__qid],
+ *           [first(left.col0) AS left.col0, ..., first(left.colN-1) AS 
left.colN-1,
+ *            max_by(struct(right.*), expr, k) AS _matches]
+ *          +- Join Inner
+ *             :- Project [left.*, monotonically_increasing_id() AS __qid]
+ *             :  +- left
+ *             +- right
+ * }}}
+ *
+ * For `DISTANCE`, `MIN_BY` is used instead of `MAX_BY`. For `LEFT OUTER`, the 
`Generate` is
+ * constructed with `outer = true` so left rows with no matches (empty/null 
`_matches`) are
+ * preserved with `NULL` right-side columns.
+ *
+ * In this initial implementation both `APPROX` and `EXACT` take the same 
brute-force rewrite
+ * path. `APPROX` establishes the contract for future indexed-ANN strategies.
+ */
+object RewriteNearestByJoin extends Rule[LogicalPlan] {

Review Comment:
   Worth adding a sentence on why this rewrite materializes the cross product 
instead of using a correlated scalar subquery like `RewriteAsOfJoin`. The short 
version: a scalar subquery returns one row, so it can't carry K matches without 
an additional `Generate`. Spelling it out helps future readers evaluate whether 
the design should ever change.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-56395][SQL] Add NEAREST BY top-K ranking join (catalyst-side) [spark]

Reply via email to