msokolov commented on issue #11086: URL: https://github.com/apache/lucene/issues/11086#issuecomment-3760846479
I want to come back to this one again. I did spend some time implementing an encoding for custom term frequencies using a floating point term frequency score that is basically a 12-bit encoding that only encodes nonegative integer numbers, and whose exponent is always nonnegative, with the idea being that the encoded bits of the maximum value look like a reasonably-bounded integer. While we can get this to work, it does have some drawbacks. The main one is that the loss of precision becomes noticeable. Possibly not in a truly meaningful way, but it does produce different search results for some queries. We could increase the precision, but then the encoding gets larger and we run into overflow issues again. For example, if our maximum encoded value uses 16 bits, ie it is 65535 viewed as an int, then we can only have 64K of these terms in a field before we overflow. The other challenge is that using a custom floating point encoding does introduce some slowdown because we have to decode the values, and we read a lot of these values per query per document and we can't rely on any intrinsics for our 12-bit encoding. Instead we are using a much simpler solution which is to positively identify fields with custom term frequencies (rather than trying to infer from having norms disabled or something, as was tried in a previous patch), and then use this knowledge to avoid the overflow in DefaultIndexingChain by treating each term occurrence as frequency of 1, while we consider the encoded "term frequency" written to term dict/postings as a term "score" that has nothing to do with frequency of occurrence which is really the spirit of the original custom term freq feature. What I want to understand is what the practical consequences would be if someone were to, say, compute the default Similarity over such a field (we don't plan to do that, but someone could, and it shouldn't break in a horrible way). Anyway I'll post a small patch that will probably explain what I'm talking about better than this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
