RE: Similarity coord,lengthNorm

Chuck Williams Mon, 07 Feb 2005 08:09:30 -0800

Hi Michael,

I'd suggest first using the explain() mechanism to figure out what's
going on.  Besides lengthNorm(), another factor that is likely skewing
your results in my experience is idf(), which Lucene typically makes
very large by squaring the intrinsic value.  I've found it helpful to
flatten lengthNorm(), tf() and idf() relative to what is used in
DefaultSimilarity.  There is a comparative evaluation of Similarity's
going on now.  You might consider looking at these:


Bug 32674 has a WikipediaSimilarity posted that you might want to try.
You might want to flatten lengthNorm() even further (e.g. all the way to
1.0), but I'd suggest trying it as is first.  If you try it, please post
your assessment.  Here's the link:
http://issues.apache.org/bugzilla/show_bug.cgi?id=32674

You also might find it interesting to read the thread entitled "RE:
Scoring benchmark evaluation.  Was RE: How to proceed with Bug 31841 -
MultiSearcher problems with Similarity.docFreq() ?" on lucene-dev, as
this contains a discussion of many of the issues.

Good luck,

Chuck

  > -----Original Message-----
  > From: Erik Hatcher [mailto:[EMAIL PROTECTED]
  > Sent: Monday, February 07, 2005 6:51 AM
  > To: Lucene Users List
  > Subject: Re: Similarity coord,lengthNorm
  > 
  > 
  > On Feb 7, 2005, at 8:53 AM, Michael Celona wrote:
  > > Would fixing the lengthNorm to 1 fix this problem?
  > 
  > Yes, it would eliminate the length of a field as a factor.
  > 
  > Your best bet is to set up a test harness where you can try out
various
  > tweaks to Similarity, but setting the length normalization factor to
  > 1.0 may be all you need to do, as the coord() takes care of the
other
  > factor you're after.
  > 
  >     Erik
  > 
  > >
  > > Michael
  > >
  > > -----Original Message-----
  > > From: Michael Celona [mailto:[EMAIL PROTECTED]
  > > Sent: Monday, February 07, 2005 8:48 AM
  > > To: Lucene Users List
  > > Subject: Similarity coord,lengthNorm
  > >
  > > I have varying length text fields which I am searching on.  I
would
  > > like
  > > relevancy to be dictated predominantly by the number of terms in
my
  > > query
  > > that match.  Right now I am seeing a high relevancy for a single
word
  > > matching in a small document even though all the terms in my query
  > > don't
  > > match.  Does, anyone have an example of a custom Similarity sub
class
  > > which
  > > overrides the coord and lengthNorm methods.
  > >
  > >
  > >
  > > Thanks..
  > >
  > > Michael
  > >
  > >
  > >
  > >
  > >
---------------------------------------------------------------------
  > > To unsubscribe, e-mail: [EMAIL PROTECTED]
  > > For additional commands, e-mail:
[EMAIL PROTECTED]
  > 
  > 
  >
---------------------------------------------------------------------
  > To unsubscribe, e-mail: [EMAIL PROTECTED]
  > For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Similarity coord,lengthNorm

Reply via email to