On Aug 29, 2008, at 7:53 AM, Sébastien Rainville wrote:

I'm curious... what do you mean by "It's not perfect (there is no such
thing) but it works pretty well in most cases, and works great if you spend a little time figuring out the right length normalization factors." ? Can you plz elaborate a little more on what are the length normalization factors exactly and what makes them good or bad... it's a part of lucene that is
really confusing me as I'm still a newbie :P



If you're a newbie, its probably best not to go there just yet, but, since you asked...

Lucene and many search systems adjust scores based on how long documents are, the theory being that a shorter document w/ the relevant terms is more (as?) interesting than (as?) a longer document with the term repeated a ton of times. It essentially acts as a counterweight to long documents with high term frequency values. But, like pretty much everything in relevance tuning, as Erik says, "it depends". It depends on things like your queries, your docs, etc. You (and by you, I mean your users) may actually prefer longer documents, or you may find that Lucene favors short documents too much. Thus, one may want to override the lengthNorm() in the Similarity class. The key, of course, is to tread into this after you have a working system and after you have established that you are, indeed, not happy with a _large_ number of results, at which point you need to do a methodical study of what the queries are, what the "right" results are, and then explore alternatives (even doing things like A/B testing), of which, length normalization modification may be one of them.

At a lower level, some people feel that a lengthNorm() of 1/ sqrt(numTerms) is not the right default, but I don't know that anyone has definitively said what a better default is. It works pretty well for most people out of the box, which is why I made the comment about it probably not being best to go there just yet. My gut says it is a value that Doug came up with way back when he was doing a lot of empirical testing and felt it was best and it really hasn't been modified since, but that is just a guess on my part, I haven't looked at the revision history of it.

You may find Doron's wiki entry informative: 
http://wiki.apache.org/lucene-java/TREC_2007_Million_Queries_Track_-_IBM_Haifa_Team

You also might find my talk at ApacheCon 07 helpful in general: http://people.apache.org/~gsingers/apachecon07/LucenePerformance.ppt , starting at slide 23 or so where I talk about relevance.

Otherwise, dig into the archives at lucene.markmail.org and look up length normalization or relevance tuning, Similarity, etc.

HTH,
Grant
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to