On Mon, Nov 13, 2017 at 8:14 PM, Chris Hostetter
<[email protected]> wrote:
>
> I'm not very familiar with exactly what code is run by each of these
> benchmarks, but is it possible the Similarity changes in LUCENE-7997 had
> an impact?  IIUC some stats/calculations were changed from floats to
> doubles ... could that change account for this?
>

It may be the case: the problem we found there is that the previous
BM25 did not obey the monotonicity requirements needed for score-based
optimizations such as LUCENE-4100 and LUCENE-7993. These algorithms
can greatly speed up our slowest queries (disjunctions, and phrase)
but need the similarity to be well-behaved in this way in order to be
correct.

In the BM25 case, scores would decrease in some situations with very
high TF values because of floating point issues, e.g. so
score(freq=100,000) would be unexpectedly less than
score(freq=99,999), all other things being equal. There may be other
ways to re-arrange the code to avoid this problem, feel free to open
an issue if you can optimize the code better while still behaving
properly!

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to