Hi Adrien The LIMIT-small dataset contains questions like for example
"Who likes Slide Rules?"https://huggingface.co/datasets/orionweller/LIMIT-small/blob/main/queries.jsonl (query_1)
and the corpus contains entries which contain exactly the words of the question, e.g. "Slide Rules"
https://huggingface.co/datasets/orionweller/LIMIT-small/blob/main/corpus.jsonl (see first and third entry)
Sparse Embeddings and BM25 work very well for these cases, but as Orion Weller and his colleagues showed in their paper, dense embeddings do not work well for such a dataset
But then again, "typos" are quite common, and I was curious whether sparse embeddings can deal with "typos", like for example
"Who likes Sleid Ruls?"which I meant with "not exact" query, because it does not write "Slide Rules" correctly.
I am native german speaking and "Sleid Ruls" phonetically in german is very similar to "Slide Rules" phonetically in english. But also from a character statistics point of view it looks similar.
I have the current protoytpe implementationIndexing: https://github.com/wyona/katie-backend/blob/e86e6c5f0ab43cf2bc5d50ce461e9656a2c981a7/src/main/java/com/wyona/katie/handlers/LuceneVectorSearchQuestionAnswerImpl.java#L248
Searching: https://github.com/wyona/katie-backend/blob/e86e6c5f0ab43cf2bc5d50ce461e9656a2c981a7/src/main/java/com/wyona/katie/handlers/LuceneVectorSearchQuestionAnswerImpl.java#L522
Please let me know if I might do or understand something wrong, any feedback is very much appreciated :-)
Thanks Michael Am 28.01.26 um 14:02 schrieb Adrien Grand:
Hi Michael, What do you mean by "not exact" queries, how do you map it on to Lucene?On Tue, Jan 27, 2026 at 1:38 PM Michael Wechner <[email protected]> wrote:I have implemented a first prototype and for the dataset orionweller/LIMIT-small using the sparse embedding model naver/splade-cocondenser-ensembledistil I get recall@2=0.9035 which is quite good for "exact" queries, e.g. "Who likes Slide Rules?" But for "not exact" queries like for example "Who likes Sleid Ruls?" I do not get good results when comparing with dense embeddings (Model: all-mpnet-base-v2) I will test some more, also using different models, but please let me know about your experiences using sparse embeddings. Thanks Michael Am 26.01.26 um 16:31 schrieb Michael Wechner:Hi Ben Cool, thanks very much for these pointers, will try it asap :-) I have recently implemented MTEB using Lucene and tested it on the LIMIT dataset https://github.com/wyona/katie-backend/blob/284ef59ab70e19d95502f61b67bedc3cf7201a31/src/main/java/com/wyona/katie/services/BenchmarkService.java#L93 and was able to reproduce some of the results of "On the theoretical limitations of embedding-based retrieval" https://arxiv.org/pdf/2508.21038 and I would be curious to see how well sparse embeddings work. All the best Michael Am 26.01.26 um 16:10 schrieb Benjamin Trent:Hey Michael, Yeah, the Apache Lucene field types used by Elasticsearch is FeatureField: https://lucene.apache.org/core/10_3_2/core/org/apache/lucene/document/FeatureField.html To query, it's a boolean query of the non-zero components with the `linearQuery` option: https://lucene.apache.org/core/10_3_2/core/org/apache/lucene/document/FeatureField.html#newLinearQuery(java.lang.String,java.lang.String,float) Hope this helps! Ben On Mon, Jan 26, 2026 at 9:47 AM Michael Wechner <[email protected]> wrote: Hi I recently started to explore sparse embeddings using the sbert / sentence_transformers library https://sbert.net/docs/sparse_encoder/usage/usage.html whereas for example the following sentence "He drove to the stadium" gets embedded as follows: tensor(indices=tensor([[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [ 1996, 2000, 2001, 2002, 2010, 2018, 2032, 2056, 2180, 2209, 2253, 2277, 2288, 2299, 2343, 2346, 2359, 2365, 2374, 2380, 2441, 2482, 2563, 2688, 2724, 2778, 2782, 2958, 3116, 3230, 3298, 3309, 3346, 3478, 3598, 3942, 4019, 4062, 4164, 4306, 4316, 4322, 4439, 4536, 4716, 5006, 5225, 5439, 5533, 5581, 5823, 6891, 7281, 7467, 7921, 8514, 9065, 11037, 21028]]), values=tensor([0.2426, 1.2840, 0.4095, 1.3777, 0.6331, 0.7404, 0.2711, 0.3561, 0.0691, 0.0325, 0.1355, 0.3256, 0.0203, 0.7970, 0.0535, 0.1135, 0.0227, 0.0375, 0.8167, 0.5986, 0.3390, 0.2573, 0.1621, 0.2597, 0.2726, 0.0191, 0.0752, 0.0597, 0.2644, 0.7811, 1.4855, 0.0663, 2.8099, 0.4074, 0.0778, 1.0642, 0.1952, 0.7472, 0.7306, 0.1108, 0.5747, 1.5341, 1.9030, 0.2264, 0.0995, 0.3023, 1.1830, 0.1279, 0.7824, 0.4283, 0.0288, 0.3535, 0.1833, 0.0554, 0.2662, 0.0574, 0.4963, 0.2751, 0.0340]), device='mps:0', size=(1, 30522), nnz=59, layout=torch.sparse_coo) The zeros just mean, that all tokens belong to the first sentence "He drove to the stadium" denoted by 0. Then the 59 relevant token Ids (of the vocabulary of size 30522) are listed and third the importance weights for the relevant tokens. IIUC OpenSearch and Elasticsearch are both supporting sparse embeddings https://sbert.net/examples/sparse_encoder/applications/semantic_search/README.html#opensearch-integration https://sbert.net/examples/sparse_encoder/applications/semantic_search/README.html#elasticsearch-integration but are sparse embeddings also supported by Lucene itself? Thanks Michael --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]-- Adrien
