Hi Adrien

The LIMIT-small dataset contains questions like for example

"Who likes Slide Rules?"

https://huggingface.co/datasets/orionweller/LIMIT-small/blob/main/queries.jsonl (query_1)

and the corpus contains entries which contain exactly the words of the question, e.g. "Slide Rules"

https://huggingface.co/datasets/orionweller/LIMIT-small/blob/main/corpus.jsonl (see first and third entry)

Sparse Embeddings and BM25 work very well for these cases, but as Orion Weller and his colleagues showed in their paper, dense embeddings do not work well for such a dataset

But then again, "typos" are quite common, and I was curious whether sparse embeddings can deal with "typos", like for example

"Who likes Sleid Ruls?"

which I meant with "not exact" query, because it does not write "Slide Rules" correctly.

I am native german speaking and "Sleid Ruls" phonetically in german is very similar to "Slide Rules" phonetically in english. But also from a character statistics point of view it looks similar.

I have the current protoytpe implementation

Indexing: https://github.com/wyona/katie-backend/blob/e86e6c5f0ab43cf2bc5d50ce461e9656a2c981a7/src/main/java/com/wyona/katie/handlers/LuceneVectorSearchQuestionAnswerImpl.java#L248

Searching: https://github.com/wyona/katie-backend/blob/e86e6c5f0ab43cf2bc5d50ce461e9656a2c981a7/src/main/java/com/wyona/katie/handlers/LuceneVectorSearchQuestionAnswerImpl.java#L522

Please let me know if I might do or understand something wrong, any feedback is very much appreciated :-)

Thanks

Michael


Am 28.01.26 um 14:02 schrieb Adrien Grand:
Hi Michael,

What do you mean by "not exact" queries, how do you map it on to Lucene?

On Tue, Jan 27, 2026 at 1:38 PM Michael Wechner <[email protected]> wrote:

    I have implemented a first prototype and for the dataset

    orionweller/LIMIT-small

    using the sparse embedding model

    naver/splade-cocondenser-ensembledistil

    I get recall@2=0.9035 which is quite good for "exact" queries,
    e.g. "Who likes Slide Rules?"

    But for "not exact" queries like for example "Who likes Sleid
    Ruls?" I do not get good results when comparing with dense
    embeddings (Model: all-mpnet-base-v2)

    I will test some more, also using different models, but please let
    me know about your experiences using sparse embeddings.

    Thanks

    Michael


    Am 26.01.26 um 16:31 schrieb Michael Wechner:

    Hi Ben

    Cool, thanks very much for these pointers, will try it asap :-)

    I have recently implemented MTEB using Lucene and tested it on
    the LIMIT dataset

    
https://github.com/wyona/katie-backend/blob/284ef59ab70e19d95502f61b67bedc3cf7201a31/src/main/java/com/wyona/katie/services/BenchmarkService.java#L93

    and was able to reproduce some of the results of "On the
    theoretical limitations of embedding-based retrieval"

    https://arxiv.org/pdf/2508.21038

    and I would be curious to see how well sparse embeddings work.

    All the best

    Michael



    Am 26.01.26 um 16:10 schrieb Benjamin Trent:
    Hey Michael,

    Yeah, the Apache Lucene field types used by Elasticsearch
    is FeatureField:
    
https://lucene.apache.org/core/10_3_2/core/org/apache/lucene/document/FeatureField.html


    To query, it's a boolean query of the non-zero components with
    the `linearQuery` option:
    
https://lucene.apache.org/core/10_3_2/core/org/apache/lucene/document/FeatureField.html#newLinearQuery(java.lang.String,java.lang.String,float)

    Hope this helps!

    Ben

    On Mon, Jan 26, 2026 at 9:47 AM Michael Wechner
    <[email protected]> wrote:

        Hi

        I recently started to explore sparse embeddings using the
        sbert /
        sentence_transformers library

        https://sbert.net/docs/sparse_encoder/usage/usage.html

        whereas for example the following sentence "He drove to the
        stadium"
        gets embedded as follows:

        tensor(indices=tensor([[    0,     0,     0,     0,    0,
         0,     0,     0,
                                     0,     0,     0,  0,     0,     0,
          0,     0,
                                     0,     0,     0,  0,     0,     0,
          0,     0,
                                     0,     0,     0,  0,     0,     0,
          0,     0,
                                     0,     0,     0,  0,     0,     0,
          0,     0,
                                     0,     0,     0,  0,     0,     0,
          0,     0,
                                     0,     0,     0,  0,     0,     0,
          0,     0,
                                     0,     0,     0],
                                [ 1996,  2000,  2001, 2002,  2010, 
        2018,
        2032,  2056,
                                  2180,  2209,  2253, 2277,  2288, 
        2299,
        2343,  2346,
                                  2359,  2365,  2374, 2380,  2441, 
        2482,
        2563,  2688,
                                  2724,  2778,  2782, 2958,  3116, 
        3230,
        3298,  3309,
                                  3346,  3478,  3598, 3942,  4019, 
        4062,
        4164,  4306,
                                  4316,  4322,  4439, 4536,  4716, 
        5006,
        5225,  5439,
                                  5533,  5581,  5823, 6891,  7281, 
        7467,
        7921,  8514,
                                  9065, 11037, 21028]]),
                values=tensor([0.2426, 1.2840, 0.4095, 1.3777,
        0.6331, 0.7404,
        0.2711,
                               0.3561, 0.0691, 0.0325, 0.1355,
        0.3256, 0.0203,
        0.7970,
                               0.0535, 0.1135, 0.0227, 0.0375,
        0.8167, 0.5986,
        0.3390,
                               0.2573, 0.1621, 0.2597, 0.2726,
        0.0191, 0.0752,
        0.0597,
                               0.2644, 0.7811, 1.4855, 0.0663,
        2.8099, 0.4074,
        0.0778,
                               1.0642, 0.1952, 0.7472, 0.7306,
        0.1108, 0.5747,
        1.5341,
                               1.9030, 0.2264, 0.0995, 0.3023,
        1.1830, 0.1279,
        0.7824,
                               0.4283, 0.0288, 0.3535, 0.1833,
        0.0554, 0.2662,
        0.0574,
                               0.4963, 0.2751, 0.0340]),
                device='mps:0', size=(1, 30522), nnz=59,
        layout=torch.sparse_coo)

        The zeros just mean, that all tokens belong to the first
        sentence "He
        drove to the stadium" denoted by 0.

        Then the 59 relevant token Ids (of the vocabulary of
        size 30522) are
        listed and third the importance weights for the relevant tokens.

        IIUC OpenSearch and Elasticsearch are both supporting sparse
        embeddings

        
https://sbert.net/examples/sparse_encoder/applications/semantic_search/README.html#opensearch-integration
        
https://sbert.net/examples/sparse_encoder/applications/semantic_search/README.html#elasticsearch-integration

        but are sparse embeddings also supported by Lucene itself?

        Thanks

        Michael






        ---------------------------------------------------------------------
        To unsubscribe, e-mail: [email protected]
        For additional commands, e-mail: [email protected]



--
Adrien

Reply via email to