[jira] Commented: (LUCENE-1908) Similarity javadocs for scoring function to relate more tightly to scoring models in effect

Doron Cohen (JIRA) Sat, 12 Sep 2009 10:49:20 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-1908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754571#action_12754571
 ]


Doron Cohen commented on LUCENE-1908:
-------------------------------------

Mark and Shai Thanks for reviewing!

Mark, I think you have a point here (and I am definitely no more an IR guy than 
you are :)).

Truth is I was surprised to find out (through your comments in LUCENE-1896) 
that this component of the score is "missing", and I indeed thought that the 
"right thing to do" (if there is such thing as "right") really is to do both: 
normalize to the unit vector, and then normalize by length to compensate for 
"unfair" advantage of long documents. 

But you're right, and the way I presented V(d) normalization and doc-length 
normalization is incorrect, as if it is a the right thing to do both, and the 
way it is presented is not doing justice to Lucene. I will change the writing. 

Interestingly, for a document containing N distinct terms, the 1/Euclidean-norm 
and Lucene's default similarity's length norm are the same: 1/sqrt(N). But if 
you double that doc to have two occurrences of each of the N distinct terms, 
its length would be 2N, 1/Euclidean-norm would be 1/sqrt(4N) but Lucene's 
default similarity's length norm would be 1/sqrt(2N). So we will punish 
documents with duplicate terms less than would the Euclidean norm...  

I am not aware of an evaluation or discussion of this - I mean - why was this 
approach selected, and so I assumed (under question) that it was merely for 
performance considerations. You said in Lucene-1896:
bq. not just similar properties - but many times better properties - the 
standard normalization would not factor in document length at all - it 
essentially removes it.
Is it really better? It seems to "punish" the same for length due to distinct 
terms, and to punish less for length due to duplicate terms. Is this really a 
desired behavior? My intuition says no, but I am not sure.

Anyhow this issue more about describing what Lucene is doing today than on what 
should Lucene do, so think I have the correct picture now (except for 
historical justification which is interesting but not a show stopper).

Shai thanks for the fixes. 

(updated patch to follow).

> Similarity javadocs for scoring function to relate more tightly to scoring 
> models in effect
> -------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1908
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1908
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Doron Cohen
>            Assignee: Doron Cohen
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1908.patch
>
>
> See discussion in the related issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1908) Similarity javadocs for scoring function to relate more tightly to scoring models in effect

Reply via email to