[
https://issues.apache.org/jira/browse/LUCENE-1534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12669843#action_12669843
]
Mike Klaas commented on LUCENE-1534:
------------------------------------
[quote]But if we feel that over-emphasizes terms with large idfs, then we
should not remove an idf factor from one vector, but rather rework our weight
heuristic, perhaps replacing idf with sqrt(idf), no?[/quote]
FWIW, having implemented web search on a large (500m) corpus, we found the
stock idf factor in lucene is too high, and ended up sqrt()'ing it in
Similarity.
That said, much of the score in this system came from anchor text, link
analysis scores, and term proximity, so it's hard to measure the impact the idf
change independently.
> idf(t) is not actually squared during scoring?
> ----------------------------------------------
>
> Key: LUCENE-1534
> URL: https://issues.apache.org/jira/browse/LUCENE-1534
> Project: Lucene - Java
> Issue Type: Bug
> Components: Query/Scoring
> Affects Versions: 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.4
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Priority: Minor
> Fix For: 2.9
>
>
> The javadocs for Similarity:
>
> http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Similarity.html
> show idf(t) as being squared when computing net query score. But I
> don't think it is actually squared, in looking at the sources? Maybe
> it used to be, eg this interesting discussion:
> http://markmail.org/message/k5pl7scmiac5wosb
> Or am I missing something? We just need to fix the javadocs to take
> away the "squared"...
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]