Hello, Koo.

I can share my shallow understanding of this algorithm.
Approximate field length is called norm and stored in a byte per document
for sake of compactness (implying performance). Norms are encoded
via org.apache.lucene.util.SmallFloat#intToByte4.
Thus, there might be just 256 different values of approximate field
length, and they are decoded by SmallFloat.byte4ToInt((byte) i) and cached
as floats in BM25Similarity#LENGTH_TABLE.
Float lengths might seem unexpected, but if you trace LENGTH_TABLE usage
you'll see that these values are used to compute ratios (floats),
therefore it's fine to use floats as inputs for score calculations.
As far as I understand, due to approximation nature:
    field_length(dl)<= field_length(dl+1)
WDYT?

On Wed, Aug 9, 2023 at 5:43 PM 承諾一輩子 <502565...@qq.com.invalid> wrote:

> Dear colleague:
> &nbsp; &nbsp;I am a drive development engineer works in ZTE corporation
> from China.
> &nbsp; &nbsp;Recently in learning lucene source code.There is a question
> that puzzled me for a long time, as follows:
> &nbsp; &nbsp;How to understand the approximate handing mechanism for field
> length "dl" in the BM25Scorer class? For example, "keywords" field has 78
> tokens. I think its field_length(dl) is 78, but lucene handled as
> 76(approximate) as described in function explainTF(Explaination freq, long
> norm).
> &nbsp; &nbsp;Thank you very much for your reading and look forward to your
> answer!
>
>
> Koo&nbsp;
> Drive development engineer



-- 
Sincerely yours
Mikhail Khludnev

Reply via email to