On Aug 29, 2008, at 7:53 AM, Sébastien Rainville wrote:
I'm curious... what do you mean by "It's not perfect (there is no such
thing) but it works pretty well in most cases, and works great if
you spend
a little time figuring out the right length normalization
factors." ? Can
you plz elaborate a little more on what are the length normalization
factors
exactly and what makes them good or bad... it's a part of lucene
that is
really confusing me as I'm still a newbie :P
If you're a newbie, its probably best not to go there just yet, but,
since you asked...
Lucene and many search systems adjust scores based on how long
documents are, the theory being that a shorter document w/ the
relevant terms is more (as?) interesting than (as?) a longer document
with the term repeated a ton of times. It essentially acts as a
counterweight to long documents with high term frequency values. But,
like pretty much everything in relevance tuning, as Erik says, "it
depends". It depends on things like your queries, your docs, etc.
You (and by you, I mean your users) may actually prefer longer
documents, or you may find that Lucene favors short documents too
much. Thus, one may want to override the lengthNorm() in the
Similarity class. The key, of course, is to tread into this after you
have a working system and after you have established that you are,
indeed, not happy with a _large_ number of results, at which point you
need to do a methodical study of what the queries are, what the
"right" results are, and then explore alternatives (even doing things
like A/B testing), of which, length normalization modification may be
one of them.
At a lower level, some people feel that a lengthNorm() of 1/
sqrt(numTerms) is not the right default, but I don't know that anyone
has definitively said what a better default is. It works pretty well
for most people out of the box, which is why I made the comment about
it probably not being best to go there just yet. My gut says it is a
value that Doug came up with way back when he was doing a lot of
empirical testing and felt it was best and it really hasn't been
modified since, but that is just a guess on my part, I haven't looked
at the revision history of it.
You may find Doron's wiki entry informative:
http://wiki.apache.org/lucene-java/TREC_2007_Million_Queries_Track_-_IBM_Haifa_Team
You also might find my talk at ApacheCon 07 helpful in general: http://people.apache.org/~gsingers/apachecon07/LucenePerformance.ppt
, starting at slide 23 or so where I talk about relevance.
Otherwise, dig into the archives at lucene.markmail.org and look up
length normalization or relevance tuning, Similarity, etc.
HTH,
Grant
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]