Hi Michael, I'd suggest first using the explain() mechanism to figure out what's going on. Besides lengthNorm(), another factor that is likely skewing your results in my experience is idf(), which Lucene typically makes very large by squaring the intrinsic value. I've found it helpful to flatten lengthNorm(), tf() and idf() relative to what is used in DefaultSimilarity. There is a comparative evaluation of Similarity's going on now. You might consider looking at these:
Bug 32674 has a WikipediaSimilarity posted that you might want to try. You might want to flatten lengthNorm() even further (e.g. all the way to 1.0), but I'd suggest trying it as is first. If you try it, please post your assessment. Here's the link: http://issues.apache.org/bugzilla/show_bug.cgi?id=32674 You also might find it interesting to read the thread entitled "RE: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?" on lucene-dev, as this contains a discussion of many of the issues. Good luck, Chuck > -----Original Message----- > From: Erik Hatcher [mailto:[EMAIL PROTECTED] > Sent: Monday, February 07, 2005 6:51 AM > To: Lucene Users List > Subject: Re: Similarity coord,lengthNorm > > > On Feb 7, 2005, at 8:53 AM, Michael Celona wrote: > > Would fixing the lengthNorm to 1 fix this problem? > > Yes, it would eliminate the length of a field as a factor. > > Your best bet is to set up a test harness where you can try out various > tweaks to Similarity, but setting the length normalization factor to > 1.0 may be all you need to do, as the coord() takes care of the other > factor you're after. > > Erik > > > > > Michael > > > > -----Original Message----- > > From: Michael Celona [mailto:[EMAIL PROTECTED] > > Sent: Monday, February 07, 2005 8:48 AM > > To: Lucene Users List > > Subject: Similarity coord,lengthNorm > > > > I have varying length text fields which I am searching on. I would > > like > > relevancy to be dictated predominantly by the number of terms in my > > query > > that match. Right now I am seeing a high relevancy for a single word > > matching in a small document even though all the terms in my query > > don't > > match. Does, anyone have an example of a custom Similarity sub class > > which > > overrides the coord and lengthNorm methods. > > > > > > > > Thanks.. > > > > Michael > > > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]