The BM25 similarity computes the normalized length as the number of tokens,
ignoring synonyms (tokens at the same position).

Then it encodes this length as an 8-bit integer in the index using this
logic:
https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/SmallFloat.java#L147-L156,
which preserves a bit more than 4 significant bits.

On Tue, Jul 13, 2021 at 1:22 PM Dwaipayan Roy <dwaipayan....@gmail.com>
wrote:

> During indexing, an inverted index is made with the term of the documents
> and the term frequency, document frequency etc. are stored. If I know
> correctly, the exact document length is not stored in the index to reduce
> the size. Instead, a normalized length is stored for each document.
> However, for most retrieval functions, document length is a necessary
> component and the normalized doc-length is used in those functions.
>
> I want to ask how exactly the normalization process is performed. The
> question might have been answered already, but I was unable to find the
> proper response. Your help is much appreciated.
>
> Thanks.
>


-- 
Adrien

Reply via email to