The BM25 similarity computes the normalized length as the number of tokens, ignoring synonyms (tokens at the same position).
Then it encodes this length as an 8-bit integer in the index using this logic: https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/SmallFloat.java#L147-L156, which preserves a bit more than 4 significant bits. On Tue, Jul 13, 2021 at 1:22 PM Dwaipayan Roy <dwaipayan....@gmail.com> wrote: > During indexing, an inverted index is made with the term of the documents > and the term frequency, document frequency etc. are stored. If I know > correctly, the exact document length is not stored in the index to reduce > the size. Instead, a normalized length is stored for each document. > However, for most retrieval functions, document length is a necessary > component and the normalized doc-length is used in those functions. > > I want to ask how exactly the normalization process is performed. The > question might have been answered already, but I was unable to find the > proper response. Your help is much appreciated. > > Thanks. > -- Adrien