Hi Michael, What do you mean by "not exact" queries, how do you map it on to Lucene?
On Tue, Jan 27, 2026 at 1:38 PM Michael Wechner <[email protected]> wrote: > I have implemented a first prototype and for the dataset > > orionweller/LIMIT-small > > using the sparse embedding model > > naver/splade-cocondenser-ensembledistil > > I get recall@2=0.9035 which is quite good for "exact" queries, e.g. "Who > likes Slide Rules?" > > But for "not exact" queries like for example "Who likes Sleid Ruls?" I do > not get good results when comparing with dense embeddings (Model: > all-mpnet-base-v2) > > I will test some more, also using different models, but please let me know > about your experiences using sparse embeddings. > > Thanks > > Michael > > > Am 26.01.26 um 16:31 schrieb Michael Wechner: > > Hi Ben > > Cool, thanks very much for these pointers, will try it asap :-) > > I have recently implemented MTEB using Lucene and tested it on the LIMIT > dataset > > > https://github.com/wyona/katie-backend/blob/284ef59ab70e19d95502f61b67bedc3cf7201a31/src/main/java/com/wyona/katie/services/BenchmarkService.java#L93 > > and was able to reproduce some of the results of "On the theoretical > limitations of embedding-based retrieval" > > https://arxiv.org/pdf/2508.21038 > > and I would be curious to see how well sparse embeddings work. > > All the best > > Michael > > > > Am 26.01.26 um 16:10 schrieb Benjamin Trent: > > Hey Michael, > > Yeah, the Apache Lucene field types used by Elasticsearch is FeatureField: > https://lucene.apache.org/core/10_3_2/core/org/apache/lucene/document/FeatureField.html > > > To query, it's a boolean query of the non-zero components with the > `linearQuery` option: > https://lucene.apache.org/core/10_3_2/core/org/apache/lucene/document/FeatureField.html#newLinearQuery(java.lang.String,java.lang.String,float) > > Hope this helps! > > Ben > > On Mon, Jan 26, 2026 at 9:47 AM Michael Wechner <[email protected]> > wrote: > >> Hi >> >> I recently started to explore sparse embeddings using the sbert / >> sentence_transformers library >> >> https://sbert.net/docs/sparse_encoder/usage/usage.html >> >> whereas for example the following sentence "He drove to the stadium" >> gets embedded as follows: >> >> tensor(indices=tensor([[ 0, 0, 0, 0, 0, 0, 0, >> 0, >> 0, 0, 0, 0, 0, 0, >> 0, 0, >> 0, 0, 0, 0, 0, 0, >> 0, 0, >> 0, 0, 0, 0, 0, 0, >> 0, 0, >> 0, 0, 0, 0, 0, 0, >> 0, 0, >> 0, 0, 0, 0, 0, 0, >> 0, 0, >> 0, 0, 0, 0, 0, 0, >> 0, 0, >> 0, 0, 0], >> [ 1996, 2000, 2001, 2002, 2010, 2018, >> 2032, 2056, >> 2180, 2209, 2253, 2277, 2288, 2299, >> 2343, 2346, >> 2359, 2365, 2374, 2380, 2441, 2482, >> 2563, 2688, >> 2724, 2778, 2782, 2958, 3116, 3230, >> 3298, 3309, >> 3346, 3478, 3598, 3942, 4019, 4062, >> 4164, 4306, >> 4316, 4322, 4439, 4536, 4716, 5006, >> 5225, 5439, >> 5533, 5581, 5823, 6891, 7281, 7467, >> 7921, 8514, >> 9065, 11037, 21028]]), >> values=tensor([0.2426, 1.2840, 0.4095, 1.3777, 0.6331, 0.7404, >> 0.2711, >> 0.3561, 0.0691, 0.0325, 0.1355, 0.3256, 0.0203, >> 0.7970, >> 0.0535, 0.1135, 0.0227, 0.0375, 0.8167, 0.5986, >> 0.3390, >> 0.2573, 0.1621, 0.2597, 0.2726, 0.0191, 0.0752, >> 0.0597, >> 0.2644, 0.7811, 1.4855, 0.0663, 2.8099, 0.4074, >> 0.0778, >> 1.0642, 0.1952, 0.7472, 0.7306, 0.1108, 0.5747, >> 1.5341, >> 1.9030, 0.2264, 0.0995, 0.3023, 1.1830, 0.1279, >> 0.7824, >> 0.4283, 0.0288, 0.3535, 0.1833, 0.0554, 0.2662, >> 0.0574, >> 0.4963, 0.2751, 0.0340]), >> device='mps:0', size=(1, 30522), nnz=59, layout=torch.sparse_coo) >> >> The zeros just mean, that all tokens belong to the first sentence "He >> drove to the stadium" denoted by 0. >> >> Then the 59 relevant token Ids (of the vocabulary of size 30522) are >> listed and third the importance weights for the relevant tokens. >> >> IIUC OpenSearch and Elasticsearch are both supporting sparse embeddings >> >> >> https://sbert.net/examples/sparse_encoder/applications/semantic_search/README.html#opensearch-integration >> >> https://sbert.net/examples/sparse_encoder/applications/semantic_search/README.html#elasticsearch-integration >> >> but are sparse embeddings also supported by Lucene itself? >> >> Thanks >> >> Michael >> >> >> >> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> >> -- Adrien
