Re: Clarity: Is there a Query boosting 50-50 over 1000-1 ?

Grant Ingersoll Fri, 29 Aug 2008 07:40:57 -0700


On Aug 29, 2008, at 7:53 AM, Sébastien Rainville wrote:

I'm curious... what do you mean by "It's not perfect (there is no such
thing) but it works pretty well in most cases, and works great ifyou spenda little time figuring out the right length normalizationfactors." ? Canyou plz elaborate a little more on what are the length normalizationfactorsexactly and what makes them good or bad... it's a part of lucenethat is
really confusing me as I'm still a newbie :P

If you're a newbie, its probably best not to go there just yet, but,since you asked...

Lucene and many search systems adjust scores based on how longdocuments are, the theory being that a shorter document w/ therelevant terms is more (as?) interesting than (as?) a longer documentwith the term repeated a ton of times. It essentially acts as acounterweight to long documents with high term frequency values. But,like pretty much everything in relevance tuning, as Erik says, "itdepends". It depends on things like your queries, your docs, etc.You (and by you, I mean your users) may actually prefer longerdocuments, or you may find that Lucene favors short documents toomuch. Thus, one may want to override the lengthNorm() in theSimilarity class. The key, of course, is to tread into this after youhave a working system and after you have established that you are,indeed, not happy with a _large_ number of results, at which point youneed to do a methodical study of what the queries are, what the"right" results are, and then explore alternatives (even doing thingslike A/B testing), of which, length normalization modification may beone of them.

At a lower level, some people feel that a lengthNorm() of 1/sqrt(numTerms) is not the right default, but I don't know that anyonehas definitively said what a better default is. It works pretty wellfor most people out of the box, which is why I made the comment aboutit probably not being best to go there just yet. My gut says it is avalue that Doug came up with way back when he was doing a lot ofempirical testing and felt it was best and it really hasn't beenmodified since, but that is just a guess on my part, I haven't lookedat the revision history of it.


You may find Doron's wiki entry informative: 
http://wiki.apache.org/lucene-java/TREC_2007_Million_Queries_Track_-_IBM_Haifa_Team

You also might find my talk at ApacheCon 07 helpful in general: http://people.apache.org/~gsingers/apachecon07/LucenePerformance.ppt, starting at slide 23 or so where I talk about relevance.

Otherwise, dig into the archives at lucene.markmail.org and look uplength normalization or relevance tuning, Similarity, etc.


HTH,
Grant
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Clarity: Is there a Query boosting 50-50 over 1000-1 ?

Reply via email to