Re: [I] Bypass total frequency check if field uses custom term frequency [LUCENE-10048] [lucene]

via GitHub Fri, 16 Jan 2026 08:30:12 -0800


msokolov commented on issue #11086:
URL: https://github.com/apache/lucene/issues/11086#issuecomment-3760846479


   I want to come back to this one again. I did spend some time implementing an 
encoding for custom term frequencies using a  floating point term frequency 
score that is basically a 12-bit encoding that only encodes nonegative integer 
numbers, and whose exponent is always nonnegative, with the idea being that the 
encoded bits of the maximum value look like a reasonably-bounded integer.  
While we can get this to work, it does have some drawbacks.  The main one is 
that the loss of precision becomes noticeable. Possibly not in a truly 
meaningful way, but it does produce different search results for some queries. 
We could increase the precision, but then the encoding gets larger and we run 
into overflow issues again. For example, if our maximum encoded value uses 16 
bits, ie it is 65535 viewed as an int, then we can only have 64K of these terms 
in a field before we overflow. 
   
   The other challenge is that using a custom floating point encoding does 
introduce some slowdown because we have to decode the values, and we read a lot 
of these values per query per document and we can't rely on any intrinsics for 
our 12-bit encoding.
   
   Instead we are using a much simpler solution which is to positively identify 
fields with custom term frequencies (rather than trying to infer from having 
norms disabled or something, as was tried in a previous patch), and then use 
this knowledge to avoid the overflow in DefaultIndexingChain by treating each 
term occurrence as frequency of 1, while we consider the encoded "term 
frequency" written to term dict/postings as a term "score" that has nothing to 
do with frequency of occurrence which  is really the spirit of the original 
custom term freq feature.
   
   What I want to understand is what the practical consequences would be if 
someone were to, say, compute the default Similarity over such a field (we 
don't plan to do that, but someone could, and it shouldn't break in a horrible 
way).  Anyway I'll post a small patch that will probably explain what I'm 
talking about better than this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Bypass total frequency check if field uses custom term frequency [LUCENE-10048] [lucene]

Reply via email to