Re: [PR] [SPARK-56395][SQL] Add NEAREST BY top-K ranking join (catalyst-side) [spark]

via GitHub Mon, 04 May 2026 13:11:59 -0700


sigmod commented on code in PR #55629:
URL: https://github.com/apache/spark/pull/55629#discussion_r3184211391



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala:
##########
@@ -657,6 +657,34 @@ trait CheckAnalysis extends LookupCatalog with 
QueryErrorsBase with PlanToString
                 messageParameters = Map.empty)
             }
 
+          // Reject streaming inputs early. The optimizer rewrite groups by a 
`__qid` derived
+          // from `MonotonicallyIncreasingID()` and feeds it to a global 
`Aggregate`, which
+          // Spark turns into a stateful streaming aggregation. Because MID 
restarts per
+          // micro-batch, `__qid` values collide across batches, and the 
stateful aggregate
+          // silently merges state from old batches into new rows that share 
the same key --
+          // producing wrong top-K results. Failing at analysis time is 
clearer than letting
+          // this slip through. Streaming support is tracked as a follow-up; 
resolving it does
+          // not require streaming-aware MID and is likely to come from a 
different grouping
+          // strategy or a dedicated physical operator.
+          case j: NearestByJoin if j.isStreaming =>
+            j.failAnalysis(
+              errorClass = "NEAREST_BY_JOIN.STREAMING_NOT_SUPPORTED",
+              messageParameters = Map.empty)
+
+          case j @ NearestByJoin(_, _, _, _, _, rankingExpression, _)
+              if !RowOrdering.isOrderable(rankingExpression.dataType) =>
+            j.failAnalysis(
+              errorClass = "NEAREST_BY_JOIN.NON_ORDERABLE_RANKING_EXPRESSION",
+              messageParameters = Map(
+                "expression" -> toSQLExpr(rankingExpression),
+                "type" -> toSQLType(rankingExpression.dataType)))
+
+          case j @ NearestByJoin(_, _, _, false, _, rankingExpression, _)
+              if !rankingExpression.deterministic =>
+            j.failAnalysis(

Review Comment:
   Do we have to fail this case?
   We still call the result of the following query "exact results" rather than 
"approximate results"?
   
   > SELECT any_value(t.v) 
   > FROM t
   
   I view them as
   - exact results:  can be deterministic or non-deterministic, but deliver a 
well-defined semantics w.r.t. input/output.
   - approx results: there's no well-defined semantics w.r.t. input/output



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-56395][SQL] Add NEAREST BY top-K ranking join (catalyst-side) [spark]

Reply via email to