Re: Sparse Embeddings

Benjamin Trent Mon, 26 Jan 2026 07:11:37 -0800

Hey Michael,

Yeah, the Apache Lucene field types used by Elasticsearch is FeatureField:
https://lucene.apache.org/core/10_3_2/core/org/apache/lucene/document/FeatureField.html



To query, it's a boolean query of the non-zero components with the
`linearQuery` option:
https://lucene.apache.org/core/10_3_2/core/org/apache/lucene/document/FeatureField.html#newLinearQuery(java.lang.String,java.lang.String,float)

Hope this helps!

Ben

On Mon, Jan 26, 2026 at 9:47 AM Michael Wechner <[email protected]>
wrote:

> Hi
>
> I recently started to explore sparse embeddings using the sbert /
> sentence_transformers library
>
> https://sbert.net/docs/sparse_encoder/usage/usage.html
>
> whereas for example the following sentence "He drove to the stadium"
> gets embedded as follows:
>
> tensor(indices=tensor([[    0,     0,     0,     0,     0,  0,     0,
>  0,
>                              0,     0,     0,     0,     0,     0,
>   0,     0,
>                              0,     0,     0,     0,     0,     0,
>   0,     0,
>                              0,     0,     0,     0,     0,     0,
>   0,     0,
>                              0,     0,     0,     0,     0,     0,
>   0,     0,
>                              0,     0,     0,     0,     0,     0,
>   0,     0,
>                              0,     0,     0,     0,     0,     0,
>   0,     0,
>                              0,     0,     0],
>                         [ 1996,  2000,  2001,  2002,  2010,  2018,
> 2032,  2056,
>                           2180,  2209,  2253,  2277,  2288,  2299,
> 2343,  2346,
>                           2359,  2365,  2374,  2380,  2441,  2482,
> 2563,  2688,
>                           2724,  2778,  2782,  2958,  3116,  3230,
> 3298,  3309,
>                           3346,  3478,  3598,  3942,  4019,  4062,
> 4164,  4306,
>                           4316,  4322,  4439,  4536,  4716,  5006,
> 5225,  5439,
>                           5533,  5581,  5823,  6891,  7281,  7467,
> 7921,  8514,
>                           9065, 11037, 21028]]),
>         values=tensor([0.2426, 1.2840, 0.4095, 1.3777, 0.6331, 0.7404,
> 0.2711,
>                        0.3561, 0.0691, 0.0325, 0.1355, 0.3256, 0.0203,
> 0.7970,
>                        0.0535, 0.1135, 0.0227, 0.0375, 0.8167, 0.5986,
> 0.3390,
>                        0.2573, 0.1621, 0.2597, 0.2726, 0.0191, 0.0752,
> 0.0597,
>                        0.2644, 0.7811, 1.4855, 0.0663, 2.8099, 0.4074,
> 0.0778,
>                        1.0642, 0.1952, 0.7472, 0.7306, 0.1108, 0.5747,
> 1.5341,
>                        1.9030, 0.2264, 0.0995, 0.3023, 1.1830, 0.1279,
> 0.7824,
>                        0.4283, 0.0288, 0.3535, 0.1833, 0.0554, 0.2662,
> 0.0574,
>                        0.4963, 0.2751, 0.0340]),
>         device='mps:0', size=(1, 30522), nnz=59, layout=torch.sparse_coo)
>
> The zeros just mean, that all tokens belong to the first sentence "He
> drove to the stadium" denoted by 0.
>
> Then the 59 relevant token Ids (of the vocabulary of size 30522) are
> listed and third the importance weights for the relevant tokens.
>
> IIUC OpenSearch and Elasticsearch are both supporting sparse embeddings
>
>
> https://sbert.net/examples/sparse_encoder/applications/semantic_search/README.html#opensearch-integration
>
> https://sbert.net/examples/sparse_encoder/applications/semantic_search/README.html#elasticsearch-integration
>
> but are sparse embeddings also supported by Lucene itself?
>
> Thanks
>
> Michael
>
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: Sparse Embeddings

Reply via email to