Hello, Koo. I can share my shallow understanding of this algorithm. Approximate field length is called norm and stored in a byte per document for sake of compactness (implying performance). Norms are encoded via org.apache.lucene.util.SmallFloat#intToByte4. Thus, there might be just 256 different values of approximate field length, and they are decoded by SmallFloat.byte4ToInt((byte) i) and cached as floats in BM25Similarity#LENGTH_TABLE. Float lengths might seem unexpected, but if you trace LENGTH_TABLE usage you'll see that these values are used to compute ratios (floats), therefore it's fine to use floats as inputs for score calculations. As far as I understand, due to approximation nature: field_length(dl)<= field_length(dl+1) WDYT?
On Wed, Aug 9, 2023 at 5:43 PM 承諾一輩子 <502565...@qq.com.invalid> wrote: > Dear colleague: > I am a drive development engineer works in ZTE corporation > from China. > Recently in learning lucene source code.There is a question > that puzzled me for a long time, as follows: > How to understand the approximate handing mechanism for field > length "dl" in the BM25Scorer class? For example, "keywords" field has 78 > tokens. I think its field_length(dl) is 78, but lucene handled as > 76(approximate) as described in function explainTF(Explaination freq, long > norm). > Thank you very much for your reading and look forward to your > answer! > > > Koo > Drive development engineer -- Sincerely yours Mikhail Khludnev