[
https://issues.apache.org/jira/browse/LUCENE-7347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15340522#comment-15340522
]
Michael McCandless commented on LUCENE-7347:
--------------------------------------------
bq. but I don't see the relationship between term saturation and coord.
The "problem" with TF/IDF is that if a single term out of the N terms in your
boolean query occurs many times in a document, it drastically increases the
score because its term saturation is "weak": {{sqrt(termFreq)}}. If the query
is {{x OR y}} and a document has 1000 x's and 0 y's, TF/IDF gives it a great
score, even though y never occurred. And so coord tries to counteract that
behavior.
Whereas BM 25 has much stronger term saturation, controlled by its {{k1}}
parameter, such that a single term in your query occurring many times does not
increase the score nearly as much as another term going from freq 0 to freq 1.
BM 25 naturally favors documents that had at least one occurrence of more of
the requested query terms. So a document with only like 5 x's and 1 y, or
something, will naturally get a better score than the first document with 1000
x's and 0 y's.
> Remove queryNorm and coords
> ---------------------------
>
> Key: LUCENE-7347
> URL: https://issues.apache.org/jira/browse/LUCENE-7347
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Adrien Grand
> Assignee: Adrien Grand
>
> These two features are specific to TF-IDF and introduce some complexity (see
> eg. handling of coords in BooleanWeight) and bugs/corner-cases (see eg. how
> taking the query norm into account causes scoring challenges on LUCENE-7337).
> Since we made BM25 the default in 6.0, I propose that we remove these
> TF-IDF-specific features in 7.0.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]