+1 (non-binding)

El vie, 1 may 2026, 21:29, Yingyi Bu <[email protected]> escribió:

> +1 (non-binding)
>
> Best,
> Yingyi
>
> On Fri, May 1, 2026 at 11:33 AM Anish Shrigondekar via dev <
> [email protected]> wrote:
>
>> +1 (non-binding)
>>
>> Would also be interesting to see how we could add streaming support for
>> this operator in the future as well
>>
>> Thanks,
>> Anish
>>
>> On Fri, May 1, 2026 at 10:42 AM Menelaos Karavelas <
>> [email protected]> wrote:
>>
>>> +1 (non-binding)
>>>
>>>
>>> On May 1, 2026, at 10:31 AM, Gengliang Wang <[email protected]> wrote:
>>>
>>> +1
>>>
>>> On Wed, Apr 29, 2026 at 8:20 AM Peter Toth <[email protected]> wrote:
>>>
>>>> +1 (non-binding)
>>>>
>>>> On Wed, Apr 29, 2026 at 4:33 PM Antônio Marcos Souza Pereira <
>>>> [email protected]> wrote:
>>>>
>>>>> +1
>>>>>
>>>>>
>>>>> On Tue, Apr 28, 2026 at 9:03 PM huaxin gao <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> +1
>>>>>>
>>>>>> On Tue, Apr 28, 2026 at 4:49 PM Wenchen Fan <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> +1
>>>>>>>
>>>>>>> On Tue, Apr 28, 2026 at 1:04 PM Zero Qu via dev <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> I'd like to call a vote on the *SPIP: NEAREST BY Top-K Ranking
>>>>>>>> Join*.
>>>>>>>>
>>>>>>>> *Motivation*
>>>>>>>> Top-K nearest neighbor search is a fundamental building block for
>>>>>>>> semantic search, retrieval-augmented generation (RAG), recommendation
>>>>>>>> systems, and geospatial nearest-neighbor queries. Today, Spark SQL 
>>>>>>>> users
>>>>>>>> have to express this pattern through verbose CROSS JOIN + window 
>>>>>>>> function
>>>>>>>> or max_by/min_by workarounds - patterns that materialize the full 
>>>>>>>> Cartesian
>>>>>>>> product and give the optimizer no semantic signal for specialized 
>>>>>>>> execution
>>>>>>>> strategies.
>>>>>>>> Competing systems (BigQuery, SQL Server 2025, Snowflake, PostgreSQL
>>>>>>>> with pgvector) all provide dedicated primitives for this. Spark 
>>>>>>>> currently
>>>>>>>> does not.
>>>>>>>>
>>>>>>>> *Proposal*
>>>>>>>> This SPIP proposes extending standard SQL JOIN syntax with a
>>>>>>>> NEAREST ... BY clause for top-K ranking joins. The BY expression is
>>>>>>>> pluggable - vector similarity, geometric distance, BM25, or any 
>>>>>>>> composite
>>>>>>>> scoring expression - making the same syntax usable across vector 
>>>>>>>> search,
>>>>>>>> geospatial, and text retrieval use cases. The APPROX / EXACT keywords 
>>>>>>>> make
>>>>>>>> the search algorithm contract explicit, ensuring future index creation 
>>>>>>>> or
>>>>>>>> deletion never silently changes query results.
>>>>>>>>
>>>>>>>> The initial scope covers SQL syntax, brute-force exact execution
>>>>>>>> (rewritten into existing physical operators: JOIN + max_by/min_by with 
>>>>>>>> K
>>>>>>>> overload), and Spark Connect / PySpark API support. Vector index DDL 
>>>>>>>> and
>>>>>>>> indexed ANN execution are deferred as future work.
>>>>>>>>
>>>>>>>> *Example SQL*:
>>>>>>>>
>>>>>>>> sql
>>>>>>>> -- Batch vector search: find the 10 most similar products for each
>>>>>>>> user
>>>>>>>> SELECT q.user_id, t.*
>>>>>>>> FROM users q
>>>>>>>> INNER JOIN products t
>>>>>>>>   APPROX NEAREST 10 BY SIMILARITY
>>>>>>>> vector_cosine_similarity(q.embedding, t.embedding)
>>>>>>>>
>>>>>>>> *Relevant Links*
>>>>>>>>
>>>>>>>> SPIP Document:
>>>>>>>> https://docs.google.com/document/d/1opFVcQJgEWDWUVB7uVlFMlNomRwxqRu8iW0JmvCvxF0/edit?tab=t.0#heading=h.hf633coi8nc7
>>>>>>>> Discussion Thread:
>>>>>>>> https://lists.apache.org/thread/zg8nk236g9f4lg6d2tm6s3xh0cfhg4hm
>>>>>>>> JIRA: https://issues.apache.org/jira/browse/SPARK-56395
>>>>>>>>
>>>>>>>> The vote will be open for at least 72 hours.
>>>>>>>> Please vote:
>>>>>>>> [ ] +1: Accept the proposal as an official SPIP
>>>>>>>> [ ] +0
>>>>>>>> [ ] -1: I don't think this is a good idea because ...
>>>>>>>> Cheers,
>>>>>>>>
>>>>>>>> Zhidong (Zero) Qu
>>>>>>>> Software Engineer
>>>>>>>> AI System
>>>>>>>>
>>>>>>>>
>>>

Reply via email to