Re: Sparse Embeddings

Adrien Grand Wed, 28 Jan 2026 06:40:10 -0800

Ah, thanks for clarifying, I had not seen the slight modification in your
query and thought that you were referring to some unsafe retrieval method
when talking about "not exact" queries.


Whether sparse embeddings can deal with typos is a good question.
Intuitively, the answer is yes, but I don't know if anyone has properly
researched this topic already.

On Wed, Jan 28, 2026 at 2:27 PM Michael Wechner <[email protected]>
wrote:

> Hi Adrien
>
> The LIMIT-small dataset contains questions like for example
>
> "Who likes Slide Rules?"
>
>
> https://huggingface.co/datasets/orionweller/LIMIT-small/blob/main/queries.jsonl
> (query_1)
>
> and the corpus contains entries which contain exactly the words of the
> question, e.g. "Slide Rules"
>
>
> https://huggingface.co/datasets/orionweller/LIMIT-small/blob/main/corpus.jsonl
> (see first and third entry)
>
> Sparse Embeddings and BM25 work very well for these cases, but as Orion
> Weller and his colleagues showed in their paper, dense embeddings do not
> work well for such a dataset
>
> But then again, "typos" are quite common, and I was curious whether sparse
> embeddings can deal with "typos", like for example
>
> "Who likes Sleid Ruls?"
>
> which I meant with "not exact" query, because it does not write "Slide
> Rules" correctly.
>
> I am native german speaking and "Sleid Ruls" phonetically in german is
> very similar to "Slide Rules" phonetically in english. But also from a
> character statistics point of view it looks similar.
>
> I have the current protoytpe implementation
>
> Indexing:
> https://github.com/wyona/katie-backend/blob/e86e6c5f0ab43cf2bc5d50ce461e9656a2c981a7/src/main/java/com/wyona/katie/handlers/LuceneVectorSearchQuestionAnswerImpl.java#L248
>
> Searching:
> https://github.com/wyona/katie-backend/blob/e86e6c5f0ab43cf2bc5d50ce461e9656a2c981a7/src/main/java/com/wyona/katie/handlers/LuceneVectorSearchQuestionAnswerImpl.java#L522
>
> Please let me know if I might do or understand something wrong, any
> feedback is very much appreciated :-)
>
> Thanks
>
> Michael
>
>
> Am 28.01.26 um 14:02 schrieb Adrien Grand:
>
> Hi Michael,
>
> What do you mean by "not exact" queries, how do you map it on to Lucene?
>
> On Tue, Jan 27, 2026 at 1:38 PM Michael Wechner <[email protected]>
> wrote:
>
>> I have implemented a first prototype and for the dataset
>>
>> orionweller/LIMIT-small
>>
>> using the sparse embedding model
>>
>> naver/splade-cocondenser-ensembledistil
>>
>> I get recall@2=0.9035 which is quite good for "exact" queries, e.g. "Who
>> likes Slide Rules?"
>>
>> But for "not exact" queries like for example "Who likes Sleid Ruls?" I do
>> not get good results when comparing with dense embeddings (Model:
>> all-mpnet-base-v2)
>>
>> I will test some more, also using different models, but please let me
>> know about your experiences using sparse embeddings.
>>
>> Thanks
>>
>> Michael
>>
>>
>> Am 26.01.26 um 16:31 schrieb Michael Wechner:
>>
>> Hi Ben
>>
>> Cool, thanks very much for these pointers, will try it asap :-)
>>
>> I have recently implemented MTEB using Lucene and tested it on the LIMIT
>> dataset
>>
>>
>> https://github.com/wyona/katie-backend/blob/284ef59ab70e19d95502f61b67bedc3cf7201a31/src/main/java/com/wyona/katie/services/BenchmarkService.java#L93
>>
>> and was able to reproduce some of the results of "On the theoretical
>> limitations of embedding-based retrieval"
>>
>> https://arxiv.org/pdf/2508.21038
>>
>> and I would be curious to see how well sparse embeddings work.
>>
>> All the best
>>
>> Michael
>>
>>
>>
>> Am 26.01.26 um 16:10 schrieb Benjamin Trent:
>>
>> Hey Michael,
>>
>> Yeah, the Apache Lucene field types used by Elasticsearch
>> is FeatureField:
>> https://lucene.apache.org/core/10_3_2/core/org/apache/lucene/document/FeatureField.html
>>
>>
>> To query, it's a boolean query of the non-zero components with the
>> `linearQuery` option:
>> https://lucene.apache.org/core/10_3_2/core/org/apache/lucene/document/FeatureField.html#newLinearQuery(java.lang.String,java.lang.String,float)
>>
>> Hope this helps!
>>
>> Ben
>>
>> On Mon, Jan 26, 2026 at 9:47 AM Michael Wechner <
>> [email protected]> wrote:
>>
>>> Hi
>>>
>>> I recently started to explore sparse embeddings using the sbert /
>>> sentence_transformers library
>>>
>>> https://sbert.net/docs/sparse_encoder/usage/usage.html
>>>
>>> whereas for example the following sentence "He drove to the stadium"
>>> gets embedded as follows:
>>>
>>> tensor(indices=tensor([[    0,     0,     0,     0,     0,  0,     0,
>>>  0,
>>>                              0,     0,     0,     0,     0,     0,
>>>   0,     0,
>>>                              0,     0,     0,     0,     0,     0,
>>>   0,     0,
>>>                              0,     0,     0,     0,     0,     0,
>>>   0,     0,
>>>                              0,     0,     0,     0,     0,     0,
>>>   0,     0,
>>>                              0,     0,     0,     0,     0,     0,
>>>   0,     0,
>>>                              0,     0,     0,     0,     0,     0,
>>>   0,     0,
>>>                              0,     0,     0],
>>>                         [ 1996,  2000,  2001,  2002,  2010,  2018,
>>> 2032,  2056,
>>>                           2180,  2209,  2253,  2277,  2288,  2299,
>>> 2343,  2346,
>>>                           2359,  2365,  2374,  2380,  2441,  2482,
>>> 2563,  2688,
>>>                           2724,  2778,  2782,  2958,  3116,  3230,
>>> 3298,  3309,
>>>                           3346,  3478,  3598,  3942,  4019,  4062,
>>> 4164,  4306,
>>>                           4316,  4322,  4439,  4536,  4716,  5006,
>>> 5225,  5439,
>>>                           5533,  5581,  5823,  6891,  7281,  7467,
>>> 7921,  8514,
>>>                           9065, 11037, 21028]]),
>>>         values=tensor([0.2426, 1.2840, 0.4095, 1.3777, 0.6331, 0.7404,
>>> 0.2711,
>>>                        0.3561, 0.0691, 0.0325, 0.1355, 0.3256, 0.0203,
>>> 0.7970,
>>>                        0.0535, 0.1135, 0.0227, 0.0375, 0.8167, 0.5986,
>>> 0.3390,
>>>                        0.2573, 0.1621, 0.2597, 0.2726, 0.0191, 0.0752,
>>> 0.0597,
>>>                        0.2644, 0.7811, 1.4855, 0.0663, 2.8099, 0.4074,
>>> 0.0778,
>>>                        1.0642, 0.1952, 0.7472, 0.7306, 0.1108, 0.5747,
>>> 1.5341,
>>>                        1.9030, 0.2264, 0.0995, 0.3023, 1.1830, 0.1279,
>>> 0.7824,
>>>                        0.4283, 0.0288, 0.3535, 0.1833, 0.0554, 0.2662,
>>> 0.0574,
>>>                        0.4963, 0.2751, 0.0340]),
>>>         device='mps:0', size=(1, 30522), nnz=59, layout=torch.sparse_coo)
>>>
>>> The zeros just mean, that all tokens belong to the first sentence "He
>>> drove to the stadium" denoted by 0.
>>>
>>> Then the 59 relevant token Ids (of the vocabulary of size 30522) are
>>> listed and third the importance weights for the relevant tokens.
>>>
>>> IIUC OpenSearch and Elasticsearch are both supporting sparse embeddings
>>>
>>>
>>> https://sbert.net/examples/sparse_encoder/applications/semantic_search/README.html#opensearch-integration
>>>
>>> https://sbert.net/examples/sparse_encoder/applications/semantic_search/README.html#elasticsearch-integration
>>>
>>> but are sparse embeddings also supported by Lucene itself?
>>>
>>> Thanks
>>>
>>> Michael
>>>
>>>
>>>
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [email protected]
>>> For additional commands, e-mail: [email protected]
>>>
>>>
>
> --
> Adrien
>
>

-- 
Adrien

Re: Sparse Embeddings

Reply via email to