Re: [PR] [SPARK-56395][SQL] Add NEAREST BY top-K ranking join (catalyst-side) [spark]

via GitHub Mon, 04 May 2026 14:29:54 -0700


zhidongqu-db commented on code in PR #55629:
URL: https://github.com/apache/spark/pull/55629#discussion_r3184629440



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala:
##########
@@ -657,6 +657,34 @@ trait CheckAnalysis extends LookupCatalog with 
QueryErrorsBase with PlanToString
                 messageParameters = Map.empty)
             }
 
+          // Reject streaming inputs early. The optimizer rewrite groups by a 
`__qid` derived
+          // from `MonotonicallyIncreasingID()` and feeds it to a global 
`Aggregate`, which
+          // Spark turns into a stateful streaming aggregation. Because MID 
restarts per
+          // micro-batch, `__qid` values collide across batches, and the 
stateful aggregate
+          // silently merges state from old batches into new rows that share 
the same key --
+          // producing wrong top-K results. Failing at analysis time is 
clearer than letting
+          // this slip through. Streaming support is tracked as a follow-up; 
resolving it does
+          // not require streaming-aware MID and is likely to come from a 
different grouping
+          // strategy or a dedicated physical operator.
+          case j: NearestByJoin if j.isStreaming =>
+            j.failAnalysis(
+              errorClass = "NEAREST_BY_JOIN.STREAMING_NOT_SUPPORTED",
+              messageParameters = Map.empty)
+
+          case j @ NearestByJoin(_, _, _, _, _, rankingExpression, _)
+              if !RowOrdering.isOrderable(rankingExpression.dataType) =>
+            j.failAnalysis(
+              errorClass = "NEAREST_BY_JOIN.NON_ORDERABLE_RANKING_EXPRESSION",
+              messageParameters = Map(
+                "expression" -> toSQLExpr(rankingExpression),
+                "type" -> toSQLType(rankingExpression.dataType)))
+
+          case j @ NearestByJoin(_, _, _, false, _, rankingExpression, _)
+              if !rankingExpression.deterministic =>
+            j.failAnalysis(

Review Comment:
   I guess this depends on how we define EXACT semantic here. We explicitly 
mentioned in the SPIP that EXACT with non-deterministic ordering expr should 
fail. The intention was to have the EXACT keyword express the semantic of 
deterministic ordering given a deterministic input and scoring expr. If the 
scoring expr is not deterministic in the first place - e.g. LLM generated 
scores, the query would fail and user should use APPROX where the keyword 
explicitly does not imply deterministic results



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-56395][SQL] Add NEAREST BY top-K ranking join (catalyst-side) [spark]

Reply via email to