I have implemented a first prototype and for the dataset

orionweller/LIMIT-small

using the sparse embedding model

naver/splade-cocondenser-ensembledistil

I get recall@2=0.9035 which is quite good for "exact" queries, e.g. "Who likes Slide Rules?"

But for "not exact" queries like for example "Who likes Sleid Ruls?" I do not get good results when comparing with dense embeddings (Model: all-mpnet-base-v2)

I will test some more, also using different models, but please let me know about your experiences using sparse embeddings.

Thanks

Michael


Am 26.01.26 um 16:31 schrieb Michael Wechner:

Hi Ben

Cool, thanks very much for these pointers, will try it asap :-)

I have recently implemented MTEB using Lucene and tested it on the LIMIT dataset

https://github.com/wyona/katie-backend/blob/284ef59ab70e19d95502f61b67bedc3cf7201a31/src/main/java/com/wyona/katie/services/BenchmarkService.java#L93

and was able to reproduce some of the results of "On the theoretical limitations of embedding-based retrieval"

https://arxiv.org/pdf/2508.21038

and I would be curious to see how well sparse embeddings work.

All the best

Michael



Am 26.01.26 um 16:10 schrieb Benjamin Trent:
Hey Michael,

Yeah, the Apache Lucene field types used by Elasticsearch is FeatureField: https://lucene.apache.org/core/10_3_2/core/org/apache/lucene/document/FeatureField.html

To query, it's a boolean query of the non-zero components with the `linearQuery` option: https://lucene.apache.org/core/10_3_2/core/org/apache/lucene/document/FeatureField.html#newLinearQuery(java.lang.String,java.lang.String,float)

Hope this helps!

Ben

On Mon, Jan 26, 2026 at 9:47 AM Michael Wechner <[email protected]> wrote:

    Hi

    I recently started to explore sparse embeddings using the sbert /
    sentence_transformers library

    https://sbert.net/docs/sparse_encoder/usage/usage.html

    whereas for example the following sentence "He drove to the stadium"
    gets embedded as follows:

    tensor(indices=tensor([[    0,     0,     0,     0,     0,  0,   
     0,     0,
                                 0,     0,     0,     0,     0,    0,
      0,     0,
                                 0,     0,     0,     0,     0,    0,
      0,     0,
                                 0,     0,     0,     0,     0,    0,
      0,     0,
                                 0,     0,     0,     0,     0,    0,
      0,     0,
                                 0,     0,     0,     0,     0,    0,
      0,     0,
                                 0,     0,     0,     0,     0,    0,
      0,     0,
                                 0,     0,     0],
                            [ 1996,  2000,  2001,  2002,  2010, 2018,
    2032,  2056,
                              2180,  2209,  2253,  2277,  2288, 2299,
    2343,  2346,
                              2359,  2365,  2374,  2380,  2441, 2482,
    2563,  2688,
                              2724,  2778,  2782,  2958,  3116, 3230,
    3298,  3309,
                              3346,  3478,  3598,  3942,  4019, 4062,
    4164,  4306,
                              4316,  4322,  4439,  4536,  4716, 5006,
    5225,  5439,
                              5533,  5581,  5823,  6891,  7281, 7467,
    7921,  8514,
                              9065, 11037, 21028]]),
            values=tensor([0.2426, 1.2840, 0.4095, 1.3777, 0.6331,
    0.7404,
    0.2711,
                           0.3561, 0.0691, 0.0325, 0.1355, 0.3256,
    0.0203,
    0.7970,
                           0.0535, 0.1135, 0.0227, 0.0375, 0.8167,
    0.5986,
    0.3390,
                           0.2573, 0.1621, 0.2597, 0.2726, 0.0191,
    0.0752,
    0.0597,
                           0.2644, 0.7811, 1.4855, 0.0663, 2.8099,
    0.4074,
    0.0778,
                           1.0642, 0.1952, 0.7472, 0.7306, 0.1108,
    0.5747,
    1.5341,
                           1.9030, 0.2264, 0.0995, 0.3023, 1.1830,
    0.1279,
    0.7824,
                           0.4283, 0.0288, 0.3535, 0.1833, 0.0554,
    0.2662,
    0.0574,
                           0.4963, 0.2751, 0.0340]),
            device='mps:0', size=(1, 30522), nnz=59,
    layout=torch.sparse_coo)

    The zeros just mean, that all tokens belong to the first sentence
    "He
    drove to the stadium" denoted by 0.

    Then the 59 relevant token Ids (of the vocabulary of size 30522) are
    listed and third the importance weights for the relevant tokens.

    IIUC OpenSearch and Elasticsearch are both supporting sparse
    embeddings

    
https://sbert.net/examples/sparse_encoder/applications/semantic_search/README.html#opensearch-integration
    
https://sbert.net/examples/sparse_encoder/applications/semantic_search/README.html#elasticsearch-integration

    but are sparse embeddings also supported by Lucene itself?

    Thanks

    Michael






    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]

Reply via email to