On Mon, Nov 13, 2017 at 8:14 PM, Chris Hostetter <[email protected]> wrote: > > I'm not very familiar with exactly what code is run by each of these > benchmarks, but is it possible the Similarity changes in LUCENE-7997 had > an impact? IIUC some stats/calculations were changed from floats to > doubles ... could that change account for this? >
It may be the case: the problem we found there is that the previous BM25 did not obey the monotonicity requirements needed for score-based optimizations such as LUCENE-4100 and LUCENE-7993. These algorithms can greatly speed up our slowest queries (disjunctions, and phrase) but need the similarity to be well-behaved in this way in order to be correct. In the BM25 case, scores would decrease in some situations with very high TF values because of floating point issues, e.g. so score(freq=100,000) would be unexpectedly less than score(freq=99,999), all other things being equal. There may be other ways to re-arrange the code to avoid this problem, feel free to open an issue if you can optimize the code better while still behaving properly! --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
