Re: [PR] [SPARK-56395][SQL] Add NEAREST BY top-K ranking join (catalyst-side) [spark]

via GitHub Tue, 05 May 2026 09:04:55 -0700


dilipbiswal commented on code in PR #55629:
URL: https://github.com/apache/spark/pull/55629#discussion_r3189883296



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala:
##########
@@ -657,6 +657,34 @@ trait CheckAnalysis extends LookupCatalog with 
QueryErrorsBase with PlanToString
                 messageParameters = Map.empty)
             }
 
+          // Reject streaming inputs early. The optimizer rewrite is built 
around an
+          // unconditioned cross-product fed into a global `Aggregate` keyed 
by a per-row
+          // identifier (`__qid`). That shape doesn't compose cleanly with 
structured-streaming
+          // semantics: a stateful aggregate keyed by a freshly-generated 
identifier accumulates
+          // state indefinitely (every batch creates new keys, old keys never 
match again) and a
+          // cross-product against a streaming right side has no bounded state 
model today.
+          // Failing at analysis time is clearer than letting either fail at 
runtime. Streaming
+          // support is tracked as a follow-up; resolving it likely comes from 
a different
+          // grouping strategy or a dedicated physical operator.
+          case j: NearestByJoin if j.isStreaming =>
+            j.failAnalysis(
+              errorClass = "NEAREST_BY_JOIN.STREAMING_NOT_SUPPORTED",
+              messageParameters = Map.empty)

Review Comment:
   @gengliangwang Thanks !! Will follow-up on this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-56395][SQL] Add NEAREST BY top-K ranking join (catalyst-side) [spark]

Reply via email to