[
https://issues.apache.org/jira/browse/LUCENE-1908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754571#action_12754571
]
Doron Cohen commented on LUCENE-1908:
-------------------------------------
Mark and Shai Thanks for reviewing!
Mark, I think you have a point here (and I am definitely no more an IR guy than
you are :)).
Truth is I was surprised to find out (through your comments in LUCENE-1896)
that this component of the score is "missing", and I indeed thought that the
"right thing to do" (if there is such thing as "right") really is to do both:
normalize to the unit vector, and then normalize by length to compensate for
"unfair" advantage of long documents.
But you're right, and the way I presented V(d) normalization and doc-length
normalization is incorrect, as if it is a the right thing to do both, and the
way it is presented is not doing justice to Lucene. I will change the writing.
Interestingly, for a document containing N distinct terms, the 1/Euclidean-norm
and Lucene's default similarity's length norm are the same: 1/sqrt(N). But if
you double that doc to have two occurrences of each of the N distinct terms,
its length would be 2N, 1/Euclidean-norm would be 1/sqrt(4N) but Lucene's
default similarity's length norm would be 1/sqrt(2N). So we will punish
documents with duplicate terms less than would the Euclidean norm...
I am not aware of an evaluation or discussion of this - I mean - why was this
approach selected, and so I assumed (under question) that it was merely for
performance considerations. You said in Lucene-1896:
bq. not just similar properties - but many times better properties - the
standard normalization would not factor in document length at all - it
essentially removes it.
Is it really better? It seems to "punish" the same for length due to distinct
terms, and to punish less for length due to duplicate terms. Is this really a
desired behavior? My intuition says no, but I am not sure.
Anyhow this issue more about describing what Lucene is doing today than on what
should Lucene do, so think I have the correct picture now (except for
historical justification which is interesting but not a show stopper).
Shai thanks for the fixes.
(updated patch to follow).
> Similarity javadocs for scoring function to relate more tightly to scoring
> models in effect
> -------------------------------------------------------------------------------------------
>
> Key: LUCENE-1908
> URL: https://issues.apache.org/jira/browse/LUCENE-1908
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Search
> Reporter: Doron Cohen
> Assignee: Doron Cohen
> Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1908.patch
>
>
> See discussion in the related issue.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]