Re: How exactly the normalized length of the documents are stored in the index

Adrien Grand Tue, 13 Jul 2021 07:09:28 -0700

The BM25 similarity computes the normalized length as the number of tokens,
ignoring synonyms (tokens at the same position).

Then it encodes this length as an 8-bit integer in the index using this
logic:
https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/SmallFloat.java#L147-L156,
which preserves a bit more than 4 significant bits.

On Tue, Jul 13, 2021 at 1:22 PM Dwaipayan Roy <[email protected]>
wrote:

> During indexing, an inverted index is made with the term of the documents
> and the term frequency, document frequency etc. are stored. If I know
> correctly, the exact document length is not stored in the index to reduce
> the size. Instead, a normalized length is stored for each document.
> However, for most retrieval functions, document length is a necessary
> component and the normalized doc-length is used in those functions.
>
> I want to ask how exactly the normalization process is performed. The
> question might have been answered already, but I was unable to find the
> proper response. Your help is much appreciated.
>
> Thanks.
>

-- 
Adrien

Re: How exactly the normalized length of the documents are stored in the index

Reply via email to