yadavay-amzn opened a new pull request, #56101: URL: https://github.com/apache/spark/pull/56101
### What changes were proposed in this pull request? Add `StreamingNearestByJoinExec`, a physical operator for NearestByJoin that avoids materializing the full N×M cross product. Instead of rewriting to cross-join + aggregate, the operator broadcasts the right side and iterates per left row with a bounded priority queue of size k. ### Why are the changes needed? The current `RewriteNearestByJoin` implementation materializes all N×M candidate pairs, shuffles them, and aggregates. At 30K×30K scale this takes 400s and uses 1.7GB. The streaming heap completes in 31s using 208MB — **13x faster and 8.3x less memory**. | Scale | Current (cross-product) | Streaming Heap | Speedup | Memory | |-------|------------------------|----------------|---------|--------| | 10K×10K | 4.2s | 0.38s | 11x | 7x less | | 30K×30K | 13.1s | 1.01s | 13x | 8.3x less | | 50K×50K | 48.7s | 4.06s | 12x | ~8x less | ### Does this PR introduce _any_ user-facing change? No. The feature is opt-in via `spark.sql.join.nearestBy.streamingHeap.enabled` (default false). ### How was this patch tested? - Correctness test verifying identical results to the rewrite (20×15 dataset, k=3) - Memory benchmark at 30K×30K showing 13x speedup and 8.3x memory reduction ### Was this patch authored or co-authored using generative AI tooling? Yes. --- **Note:** This is a draft/prototype for discussion. Design doc: https://quip-amazon.com/IeZPAZPA9PF4 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
