Hey Michael,
Yeah, the Apache Lucene field types used by Elasticsearch
is FeatureField:
https://lucene.apache.org/core/10_3_2/core/org/apache/lucene/document/FeatureField.html
To query, it's a boolean query of the non-zero components with the
`linearQuery` option:
https://lucene.apache.org/core/10_3_2/core/org/apache/lucene/document/FeatureField.html#newLinearQuery(java.lang.String,java.lang.String,float)
Hope this helps!
Ben
On Mon, Jan 26, 2026 at 9:47 AM Michael Wechner
<[email protected]> wrote:
Hi
I recently started to explore sparse embeddings using the sbert /
sentence_transformers library
https://sbert.net/docs/sparse_encoder/usage/usage.html
whereas for example the following sentence "He drove to the stadium"
gets embedded as follows:
tensor(indices=tensor([[ 0, 0, 0, 0, 0, 0,
0, 0,
0, 0, 0, 0, 0, 0,
0, 0,
0, 0, 0, 0, 0, 0,
0, 0,
0, 0, 0, 0, 0, 0,
0, 0,
0, 0, 0, 0, 0, 0,
0, 0,
0, 0, 0, 0, 0, 0,
0, 0,
0, 0, 0, 0, 0, 0,
0, 0,
0, 0, 0],
[ 1996, 2000, 2001, 2002, 2010, 2018,
2032, 2056,
2180, 2209, 2253, 2277, 2288, 2299,
2343, 2346,
2359, 2365, 2374, 2380, 2441, 2482,
2563, 2688,
2724, 2778, 2782, 2958, 3116, 3230,
3298, 3309,
3346, 3478, 3598, 3942, 4019, 4062,
4164, 4306,
4316, 4322, 4439, 4536, 4716, 5006,
5225, 5439,
5533, 5581, 5823, 6891, 7281, 7467,
7921, 8514,
9065, 11037, 21028]]),
values=tensor([0.2426, 1.2840, 0.4095, 1.3777, 0.6331,
0.7404,
0.2711,
0.3561, 0.0691, 0.0325, 0.1355, 0.3256,
0.0203,
0.7970,
0.0535, 0.1135, 0.0227, 0.0375, 0.8167,
0.5986,
0.3390,
0.2573, 0.1621, 0.2597, 0.2726, 0.0191,
0.0752,
0.0597,
0.2644, 0.7811, 1.4855, 0.0663, 2.8099,
0.4074,
0.0778,
1.0642, 0.1952, 0.7472, 0.7306, 0.1108,
0.5747,
1.5341,
1.9030, 0.2264, 0.0995, 0.3023, 1.1830,
0.1279,
0.7824,
0.4283, 0.0288, 0.3535, 0.1833, 0.0554,
0.2662,
0.0574,
0.4963, 0.2751, 0.0340]),
device='mps:0', size=(1, 30522), nnz=59,
layout=torch.sparse_coo)
The zeros just mean, that all tokens belong to the first sentence "He
drove to the stadium" denoted by 0.
Then the 59 relevant token Ids (of the vocabulary of size 30522) are
listed and third the importance weights for the relevant tokens.
IIUC OpenSearch and Elasticsearch are both supporting sparse
embeddings
https://sbert.net/examples/sparse_encoder/applications/semantic_search/README.html#opensearch-integration
https://sbert.net/examples/sparse_encoder/applications/semantic_search/README.html#elasticsearch-integration
but are sparse embeddings also supported by Lucene itself?
Thanks
Michael
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]