Am 28.01.26 um 16:22 schrieb Michael Wechner:

Whether sparse embeddings can deal with typos is a good question. Intuitively, the answer is yes, but I don't know if anyone has properly researched this topic already.

My very simple tests with the following sparse embedding model

naver/splade-cocondenser-ensembledistil
https://huggingface.co/naver/splade-cocondenser-ensembledistil

did unfortunately not deal well with such typos, but I am trying to understand 
better... will keep you posted :-)

In the case above, the decoding of the sparse embedding of "Slide Rules" is as follows:

Slide Rules -> Top 10 tokens:  ("slide", 3.19), ("slides", 2.45), ("rules", 2.35), ("rule", 2.15), ("sliding", 1.37), ("game", 0.75), ("movement", 0.61), ("technique", 0.45), ("kyle", 0.43), ("slope", 0.43)

and the decoding of the sparse embedding of "Sleid Ruls" is as follows:

Sleid Ruls -> Top 10 tokens:  ("##ei", 2.86), ("##ls", 1.93), ("sl", 1.77), ("ru", 1.65), ("##d", 1.49), ("##l", 1.33), ("dr", 0.50), ("ski", 0.50), ("strain", 0.46), ("germany", 0.39)

which probably explains why the recall will be bad for the query "Sleid Ruls"



Thanks

Michael



On Wed, Jan 28, 2026 at 2:27 PM Michael Wechner <[email protected]> wrote:

    Hi Adrien

    The LIMIT-small dataset contains questions like for example

    "Who likes Slide Rules?"

    
https://huggingface.co/datasets/orionweller/LIMIT-small/blob/main/queries.jsonl
    (query_1)

    and the corpus contains entries which contain exactly the words
    of the question, e.g. "Slide Rules"

    
https://huggingface.co/datasets/orionweller/LIMIT-small/blob/main/corpus.jsonl
    (see first and third entry)

    Sparse Embeddings and BM25 work very well for these cases, but as
    Orion Weller and his colleagues showed in their paper, dense
    embeddings do not work well for such a dataset

    But then again, "typos" are quite common, and I was curious
    whether sparse embeddings can deal with "typos", like for example

    "Who likes Sleid Ruls?"

    which I meant with "not exact" query, because it does not write
    "Slide Rules" correctly.

    I am native german speaking and "Sleid Ruls" phonetically in
    german is very similar to "Slide Rules" phonetically in english.
    But also from a character statistics point of view it looks similar.

    I have the current protoytpe implementation

    Indexing:
    
https://github.com/wyona/katie-backend/blob/e86e6c5f0ab43cf2bc5d50ce461e9656a2c981a7/src/main/java/com/wyona/katie/handlers/LuceneVectorSearchQuestionAnswerImpl.java#L248

    Searching:
    
https://github.com/wyona/katie-backend/blob/e86e6c5f0ab43cf2bc5d50ce461e9656a2c981a7/src/main/java/com/wyona/katie/handlers/LuceneVectorSearchQuestionAnswerImpl.java#L522

    Please let me know if I might do or understand something wrong,
    any feedback is very much appreciated :-)

    Thanks

    Michael


    Am 28.01.26 um 14:02 schrieb Adrien Grand:
    Hi Michael,

    What do you mean by "not exact" queries, how do you map it on to
    Lucene?

    On Tue, Jan 27, 2026 at 1:38 PM Michael Wechner
    <[email protected]> wrote:

        I have implemented a first prototype and for the dataset

        orionweller/LIMIT-small

        using the sparse embedding model

        naver/splade-cocondenser-ensembledistil

        I get recall@2=0.9035 which is quite good for "exact"
        queries, e.g. "Who likes Slide Rules?"

        But for "not exact" queries like for example "Who likes
        Sleid Ruls?" I do not get good results when comparing with
        dense embeddings (Model: all-mpnet-base-v2)

        I will test some more, also using different models, but
        please let me know about your experiences using sparse
        embeddings.

        Thanks

        Michael


        Am 26.01.26 um 16:31 schrieb Michael Wechner:

        Hi Ben

        Cool, thanks very much for these pointers, will try it asap :-)

        I have recently implemented MTEB using Lucene and tested it
        on the LIMIT dataset

        
https://github.com/wyona/katie-backend/blob/284ef59ab70e19d95502f61b67bedc3cf7201a31/src/main/java/com/wyona/katie/services/BenchmarkService.java#L93

        and was able to reproduce some of the results of "On the
        theoretical limitations of embedding-based retrieval"

        https://arxiv.org/pdf/2508.21038

        and I would be curious to see how well sparse embeddings work.

        All the best

        Michael



        Am 26.01.26 um 16:10 schrieb Benjamin Trent:
        Hey Michael,

        Yeah, the Apache Lucene field types used by Elasticsearch
        is FeatureField:
        
https://lucene.apache.org/core/10_3_2/core/org/apache/lucene/document/FeatureField.html


        To query, it's a boolean query of the non-zero components
        with the `linearQuery` option:
        
https://lucene.apache.org/core/10_3_2/core/org/apache/lucene/document/FeatureField.html#newLinearQuery(java.lang.String,java.lang.String,float)

        Hope this helps!

        Ben

        On Mon, Jan 26, 2026 at 9:47 AM Michael Wechner
        <[email protected]> wrote:

            Hi

            I recently started to explore sparse embeddings using
            the sbert /
            sentence_transformers library

            https://sbert.net/docs/sparse_encoder/usage/usage.html

            whereas for example the following sentence "He drove
            to the stadium"
            gets embedded as follows:

            tensor(indices=tensor([[    0,     0,  0,     0,   
             0,  0,     0,     0,
                                         0,     0,  0,     0,   
             0,     0,
              0,     0,
                                         0,     0,  0,     0,   
             0,     0,
              0,     0,
                                         0,     0,  0,     0,   
             0,     0,
              0,     0,
                                         0,     0,  0,     0,   
             0,     0,
              0,     0,
                                         0,     0,  0,     0,   
             0,     0,
              0,     0,
                                         0,     0,  0,     0,   
             0,     0,
              0,     0,
                                         0,     0,  0],
                                    [ 1996,  2000, 2001,  2002, 
            2010,  2018,
            2032,  2056,
                                      2180,  2209, 2253,  2277, 
            2288,  2299,
            2343,  2346,
                                      2359,  2365, 2374,  2380, 
            2441,  2482,
            2563,  2688,
                                      2724,  2778, 2782,  2958, 
            3116,  3230,
            3298,  3309,
                                      3346,  3478, 3598,  3942, 
            4019,  4062,
            4164,  4306,
                                      4316,  4322, 4439,  4536, 
            4716,  5006,
            5225,  5439,
                                      5533,  5581, 5823,  6891, 
            7281,  7467,
            7921,  8514,
                                      9065, 11037, 21028]]),
                    values=tensor([0.2426, 1.2840, 0.4095, 1.3777,
            0.6331, 0.7404,
            0.2711,
                                   0.3561, 0.0691, 0.0325, 0.1355,
            0.3256, 0.0203,
            0.7970,
                                   0.0535, 0.1135, 0.0227, 0.0375,
            0.8167, 0.5986,
            0.3390,
                                   0.2573, 0.1621, 0.2597, 0.2726,
            0.0191, 0.0752,
            0.0597,
                                   0.2644, 0.7811, 1.4855, 0.0663,
            2.8099, 0.4074,
            0.0778,
                                   1.0642, 0.1952, 0.7472, 0.7306,
            0.1108, 0.5747,
            1.5341,
                                   1.9030, 0.2264, 0.0995, 0.3023,
            1.1830, 0.1279,
            0.7824,
                                   0.4283, 0.0288, 0.3535, 0.1833,
            0.0554, 0.2662,
            0.0574,
                                   0.4963, 0.2751, 0.0340]),
                    device='mps:0', size=(1, 30522), nnz=59,
            layout=torch.sparse_coo)

            The zeros just mean, that all tokens belong to the
            first sentence "He
            drove to the stadium" denoted by 0.

            Then the 59 relevant token Ids (of the vocabulary of
            size 30522) are
            listed and third the importance weights for the
            relevant tokens.

            IIUC OpenSearch and Elasticsearch are both supporting
            sparse embeddings

            
https://sbert.net/examples/sparse_encoder/applications/semantic_search/README.html#opensearch-integration
            
https://sbert.net/examples/sparse_encoder/applications/semantic_search/README.html#elasticsearch-integration

            but are sparse embeddings also supported by Lucene itself?

            Thanks

            Michael






            
---------------------------------------------------------------------
            To unsubscribe, e-mail: [email protected]
            For additional commands, e-mail:
            [email protected]



-- Adrien



--
Adrien

Reply via email to