[PR] [SPARK-XXXXX][SQL] Add streaming heap operator for NearestByJoin [spark]

via GitHub Mon, 25 May 2026 10:43:14 -0700


yadavay-amzn opened a new pull request, #56101:
URL: https://github.com/apache/spark/pull/56101


   ### What changes were proposed in this pull request?
   
   Add `StreamingNearestByJoinExec`, a physical operator for NearestByJoin that 
avoids materializing the full N×M cross product. Instead of rewriting to 
cross-join + aggregate, the operator broadcasts the right side and iterates per 
left row with a bounded priority queue of size k.
   
   ### Why are the changes needed?
   
   The current `RewriteNearestByJoin` implementation materializes all N×M 
candidate pairs, shuffles them, and aggregates. At 30K×30K scale this takes 
400s and uses 1.7GB. The streaming heap completes in 31s using 208MB — **13x 
faster and 8.3x less memory**.
   
   | Scale | Current (cross-product) | Streaming Heap | Speedup | Memory |
   |-------|------------------------|----------------|---------|--------|
   | 10K×10K | 4.2s | 0.38s | 11x | 7x less |
   | 30K×30K | 13.1s | 1.01s | 13x | 8.3x less |
   | 50K×50K | 48.7s | 4.06s | 12x | ~8x less |
   
   ### Does this PR introduce _any_ user-facing change?
   
   No. The feature is opt-in via 
`spark.sql.join.nearestBy.streamingHeap.enabled` (default false).
   
   ### How was this patch tested?
   
   - Correctness test verifying identical results to the rewrite (20×15 
dataset, k=3)
   - Memory benchmark at 30K×30K showing 13x speedup and 8.3x memory reduction
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Yes.
   
   ---
   
   **Note:** This is a draft/prototype for discussion. Design doc: 
https://quip-amazon.com/IeZPAZPA9PF4


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-XXXXX][SQL] Add streaming heap operator for NearestByJoin [spark]

Reply via email to