SweetSpotSimiliarity

Marvin Humphrey Tue, 23 May 2006 21:55:43 -0700

[note: subject header changed from "Re: [jira] Updated: (LUCENE-577)SweetSpotSimiliarity"]


Thought-provoking stuff, Hoss...


On May 23, 2006, at 5:55 PM, Hoss Man (JIRA) wrote:

This is a new Similarity implimention for the contrib/miscellaneous/ package, it provides a Similiarty designed forpeople who know the "sweetspot" of their data. three major piecesof functionality are included:
1) a lengthNorm which creates a "plateau" of values.

Presumably you had this in the can, and didn't just implement ittoday. :) For those of you who didn't see this afternoon's thread"Per-Field Analyzer" on java-user, KinoSearch has used a plateaulengthNorm since version 0.06...


   1 / sqrt(max(100, numTerms))

... and it's been a mixed bag.

The suggestion came my way via Mark Bennett apparently from Dougoriginally, though I didn't see that thread. Earlier discussion athttp://xrl.us/mpkp (Link to mail-archives.apache.org). Mark's niftygraph is still up (linked from his email).

Making that algo the default achieved my goal: downgrade the type of"stub" documents Lucene tends to favor. However, it also stoppedexcellent matches in fields which are supposed to be short -- liketitle -- from getting a good solid lift.

The only answer seems to be to apply different lengthNorm algos todifferent fields.


What uses have you found a plateau lengthNorm, Hoss?

2) a baseline tf that provides a fixed value for tf's up to aminimum, at which point it becomes a sqrt curve (this is used bythe tf(int) function.3) a hyperbolic tf function which is best explained by graphing theequation. this isn't used by default, but is available forsubclasses to call from their own tf functions.


... and when do you use these custom tf's?

I tried to graph the hyperbolic function (tip for OS X users: checkout Grapher.app, in Utilities). It looks like by default, everythingcancels out it returns a constant 2. But it's pretty complicated, somaybe I missed something.

My interest in this is being driven by a really savvy client with aformal mathematics background and a good feel for search enginedesign though no formal IR training. Today, he wrote, "The title isnot a discussion. It's binary; this is being considered or itisn't. The more words that are being considered, the lesssignificant any one is, but you can't get more considered by beingmentioned more than once in the title."

I think I would implement this by having tf always return 1 for thetitle field.

Thought: It would be really handy if we had a benchmarking test forIR precision.


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

SweetSpotSimiliarity

Reply via email to