Re: [PR] [SPARK-56395][SQL] Add NEAREST BY top-K ranking join (catalyst-side) [spark]

via GitHub Sat, 02 May 2026 16:06:55 -0700


dilipbiswal commented on code in PR #55629:
URL: https://github.com/apache/spark/pull/55629#discussion_r3177345024



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala:
##########
@@ -657,6 +657,29 @@ trait CheckAnalysis extends LookupCatalog with 
QueryErrorsBase with PlanToString
                 messageParameters = Map.empty)
             }
 
+          // Reject streaming inputs early. The optimizer rewrite introduces
+          // `MonotonicallyIncreasingID()`, which is per-batch only and would 
silently produce
+          // incorrect results across micro-batches; failing at analysis time 
is clearer than
+          // letting the streaming check fire on an incidental MID node.
+          case j: NearestByJoin if j.isStreaming =>

Review Comment:
   @zhidongqu-db Thanks for the suggestion. I would try to address this in a 
folllow-up.



##########
sql/core/src/test/resources/sql-tests/inputs/join-nearest-by.sql:
##########
@@ -0,0 +1,57 @@
+-- Test cases for NEAREST BY top-K ranking join.
+

Review Comment:
   Have added the tests.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-56395][SQL] Add NEAREST BY top-K ranking join (catalyst-side) [spark]

Reply via email to