[note: subject header changed from "Re: [jira] Updated: (LUCENE-577) SweetSpotSimiliarity"]

Thought-provoking stuff, Hoss...

On May 23, 2006, at 5:55 PM, Hoss Man (JIRA) wrote:

This is a new Similarity implimention for the contrib/ miscellaneous/ package, it provides a Similiarty designed for people who know the "sweetspot" of their data. three major pieces of functionality are included:
1) a lengthNorm which creates a "plateau" of values.

Presumably you had this in the can, and didn't just implement it today. :) For those of you who didn't see this afternoon's thread "Per-Field Analyzer" on java-user, KinoSearch has used a plateau lengthNorm since version 0.06...

   1 / sqrt(max(100, numTerms))

... and it's been a mixed bag.

The suggestion came my way via Mark Bennett apparently from Doug originally, though I didn't see that thread. Earlier discussion at http://xrl.us/mpkp (Link to mail-archives.apache.org). Mark's nifty graph is still up (linked from his email).

Making that algo the default achieved my goal: downgrade the type of "stub" documents Lucene tends to favor. However, it also stopped excellent matches in fields which are supposed to be short -- like title -- from getting a good solid lift.

The only answer seems to be to apply different lengthNorm algos to different fields.

What uses have you found a plateau lengthNorm, Hoss?

2) a baseline tf that provides a fixed value for tf's up to a minimum, at which point it becomes a sqrt curve (this is used by the tf(int) function. 3) a hyperbolic tf function which is best explained by graphing the equation. this isn't used by default, but is available for subclasses to call from their own tf functions.

... and when do you use these custom tf's?

I tried to graph the hyperbolic function (tip for OS X users: check out Grapher.app, in Utilities). It looks like by default, everything cancels out it returns a constant 2. But it's pretty complicated, so maybe I missed something.

My interest in this is being driven by a really savvy client with a formal mathematics background and a good feel for search engine design though no formal IR training. Today, he wrote, "The title is not a discussion. It's binary; this is being considered or it isn't. The more words that are being considered, the less significant any one is, but you can't get more considered by being mentioned more than once in the title."

I think I would implement this by having tf always return 1 for the title field.

Thought: It would be really handy if we had a benchmarking test for IR precision.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to