Re: Sparse Embeddings

Adrien Grand Wed, 28 Jan 2026 05:02:43 -0800

Hi Michael,

What do you mean by "not exact" queries, how do you map it on to Lucene?


On Tue, Jan 27, 2026 at 1:38 PM Michael Wechner <[email protected]>
wrote:

> I have implemented a first prototype and for the dataset
>
> orionweller/LIMIT-small
>
> using the sparse embedding model
>
> naver/splade-cocondenser-ensembledistil
>
> I get recall@2=0.9035 which is quite good for "exact" queries, e.g. "Who
> likes Slide Rules?"
>
> But for "not exact" queries like for example "Who likes Sleid Ruls?" I do
> not get good results when comparing with dense embeddings (Model:
> all-mpnet-base-v2)
>
> I will test some more, also using different models, but please let me know
> about your experiences using sparse embeddings.
>
> Thanks
>
> Michael
>
>
> Am 26.01.26 um 16:31 schrieb Michael Wechner:
>
> Hi Ben
>
> Cool, thanks very much for these pointers, will try it asap :-)
>
> I have recently implemented MTEB using Lucene and tested it on the LIMIT
> dataset
>
>
> https://github.com/wyona/katie-backend/blob/284ef59ab70e19d95502f61b67bedc3cf7201a31/src/main/java/com/wyona/katie/services/BenchmarkService.java#L93
>
> and was able to reproduce some of the results of "On the theoretical
> limitations of embedding-based retrieval"
>
> https://arxiv.org/pdf/2508.21038
>
> and I would be curious to see how well sparse embeddings work.
>
> All the best
>
> Michael
>
>
>
> Am 26.01.26 um 16:10 schrieb Benjamin Trent:
>
> Hey Michael,
>
> Yeah, the Apache Lucene field types used by Elasticsearch is FeatureField:
> https://lucene.apache.org/core/10_3_2/core/org/apache/lucene/document/FeatureField.html
>
>
> To query, it's a boolean query of the non-zero components with the
> `linearQuery` option:
> https://lucene.apache.org/core/10_3_2/core/org/apache/lucene/document/FeatureField.html#newLinearQuery(java.lang.String,java.lang.String,float)
>
> Hope this helps!
>
> Ben
>
> On Mon, Jan 26, 2026 at 9:47 AM Michael Wechner <[email protected]>
> wrote:
>
>> Hi
>>
>> I recently started to explore sparse embeddings using the sbert /
>> sentence_transformers library
>>
>> https://sbert.net/docs/sparse_encoder/usage/usage.html
>>
>> whereas for example the following sentence "He drove to the stadium"
>> gets embedded as follows:
>>
>> tensor(indices=tensor([[    0,     0,     0,     0,     0,  0,     0,
>>  0,
>>                              0,     0,     0,     0,     0,     0,
>>   0,     0,
>>                              0,     0,     0,     0,     0,     0,
>>   0,     0,
>>                              0,     0,     0,     0,     0,     0,
>>   0,     0,
>>                              0,     0,     0,     0,     0,     0,
>>   0,     0,
>>                              0,     0,     0,     0,     0,     0,
>>   0,     0,
>>                              0,     0,     0,     0,     0,     0,
>>   0,     0,
>>                              0,     0,     0],
>>                         [ 1996,  2000,  2001,  2002,  2010,  2018,
>> 2032,  2056,
>>                           2180,  2209,  2253,  2277,  2288,  2299,
>> 2343,  2346,
>>                           2359,  2365,  2374,  2380,  2441,  2482,
>> 2563,  2688,
>>                           2724,  2778,  2782,  2958,  3116,  3230,
>> 3298,  3309,
>>                           3346,  3478,  3598,  3942,  4019,  4062,
>> 4164,  4306,
>>                           4316,  4322,  4439,  4536,  4716,  5006,
>> 5225,  5439,
>>                           5533,  5581,  5823,  6891,  7281,  7467,
>> 7921,  8514,
>>                           9065, 11037, 21028]]),
>>         values=tensor([0.2426, 1.2840, 0.4095, 1.3777, 0.6331, 0.7404,
>> 0.2711,
>>                        0.3561, 0.0691, 0.0325, 0.1355, 0.3256, 0.0203,
>> 0.7970,
>>                        0.0535, 0.1135, 0.0227, 0.0375, 0.8167, 0.5986,
>> 0.3390,
>>                        0.2573, 0.1621, 0.2597, 0.2726, 0.0191, 0.0752,
>> 0.0597,
>>                        0.2644, 0.7811, 1.4855, 0.0663, 2.8099, 0.4074,
>> 0.0778,
>>                        1.0642, 0.1952, 0.7472, 0.7306, 0.1108, 0.5747,
>> 1.5341,
>>                        1.9030, 0.2264, 0.0995, 0.3023, 1.1830, 0.1279,
>> 0.7824,
>>                        0.4283, 0.0288, 0.3535, 0.1833, 0.0554, 0.2662,
>> 0.0574,
>>                        0.4963, 0.2751, 0.0340]),
>>         device='mps:0', size=(1, 30522), nnz=59, layout=torch.sparse_coo)
>>
>> The zeros just mean, that all tokens belong to the first sentence "He
>> drove to the stadium" denoted by 0.
>>
>> Then the 59 relevant token Ids (of the vocabulary of size 30522) are
>> listed and third the importance weights for the relevant tokens.
>>
>> IIUC OpenSearch and Elasticsearch are both supporting sparse embeddings
>>
>>
>> https://sbert.net/examples/sparse_encoder/applications/semantic_search/README.html#opensearch-integration
>>
>> https://sbert.net/examples/sparse_encoder/applications/semantic_search/README.html#elasticsearch-integration
>>
>> but are sparse embeddings also supported by Lucene itself?
>>
>> Thanks
>>
>> Michael
>>
>>
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>

-- 
Adrien

Re: Sparse Embeddings

Reply via email to