+1

On Tue, Apr 28, 2026 at 1:04 PM Zero Qu via dev <[email protected]>
wrote:

> Hi all,
>
> I'd like to call a vote on the *SPIP: NEAREST BY Top-K Ranking Join*.
>
> *Motivation*
> Top-K nearest neighbor search is a fundamental building block for semantic
> search, retrieval-augmented generation (RAG), recommendation systems, and
> geospatial nearest-neighbor queries. Today, Spark SQL users have to express
> this pattern through verbose CROSS JOIN + window function or max_by/min_by
> workarounds - patterns that materialize the full Cartesian product and give
> the optimizer no semantic signal for specialized execution strategies.
> Competing systems (BigQuery, SQL Server 2025, Snowflake, PostgreSQL with
> pgvector) all provide dedicated primitives for this. Spark currently does
> not.
>
> *Proposal*
> This SPIP proposes extending standard SQL JOIN syntax with a NEAREST ...
> BY clause for top-K ranking joins. The BY expression is pluggable - vector
> similarity, geometric distance, BM25, or any composite scoring expression -
> making the same syntax usable across vector search, geospatial, and text
> retrieval use cases. The APPROX / EXACT keywords make the search algorithm
> contract explicit, ensuring future index creation or deletion never
> silently changes query results.
>
> The initial scope covers SQL syntax, brute-force exact execution
> (rewritten into existing physical operators: JOIN + max_by/min_by with K
> overload), and Spark Connect / PySpark API support. Vector index DDL and
> indexed ANN execution are deferred as future work.
>
> *Example SQL*:
>
> sql
> -- Batch vector search: find the 10 most similar products for each user
> SELECT q.user_id, t.*
> FROM users q
> INNER JOIN products t
>   APPROX NEAREST 10 BY SIMILARITY vector_cosine_similarity(q.embedding,
> t.embedding)
>
> *Relevant Links*
>
> SPIP Document:
> https://docs.google.com/document/d/1opFVcQJgEWDWUVB7uVlFMlNomRwxqRu8iW0JmvCvxF0/edit?tab=t.0#heading=h.hf633coi8nc7
> Discussion Thread:
> https://lists.apache.org/thread/zg8nk236g9f4lg6d2tm6s3xh0cfhg4hm
> JIRA: https://issues.apache.org/jira/browse/SPARK-56395
>
> The vote will be open for at least 72 hours.
> Please vote:
> [ ] +1: Accept the proposal as an official SPIP
> [ ] +0
> [ ] -1: I don't think this is a good idea because ...
> Cheers,
>
> Zhidong (Zero) Qu
> Software Engineer
> AI System
>
>

Reply via email to