Doug,
I've implemented a subclass of DefaultSimilarity (called WESimilarity.java,
copy attached) which defines a new lengthNorm() method more or less as you
suggested. I then added a line prior to using my IndexWriter:
writer.setSimilarity(new WESimilarity()), and a similar line prior to using
my IndexSeacher: searcher.setSimilarity(new WESimilarity()).
The result:
1) There's no change whatsoever in the computed scores, and
2) The debugging messages never get printed out.
I know the WESimilarity is being used (because if I rename it I get an
exception), but it does not appear that the new lengthNorm() method is being
called.
It's probably some silly goof, but I can't figure out where it is.
If you (or anyone else, of course) have any ideas/suggestions, I'd
appreciate them.
Regards,
Terry
----- Original Message -----
From: "Terry Steichen" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Monday, February 10, 2003 2:28 PM
Subject: Re: Computing Relevancy Differently
> Doug,
>
> That's excellent. Just what I've been looking for. I'll start
> experimenting shortly.
>
> Regards,
>
> Terry
>
> ----- Original Message -----
> From: "Doug Cutting" <[EMAIL PROTECTED]>
> To: "Lucene Users List" <[EMAIL PROTECTED]>
> Sent: Monday, February 10, 2003 1:57 PM
> Subject: Re: Computing Relevancy Differently
>
>
> > Terry Steichen wrote:
> > > Can you give me an idea of what to replace the lengthNorm() method
with
> to,
> > > for example, remove any special weight given to shorter matching
> documents?
> >
> > The goal of the default implementation is not to give any special weight
> > to shorter documents, but rather to remove the advantage longer
> > documents have. Longer documents are likely to have more matches simply
> > because they contain more terms. Also, for the query "foo", a document
> > containing just "foo" is a better match than a longer one containing
> > "foo bar baz", since the match is more exact.
> >
> > However, one problem with this approach can be that very short documents
> > are in fact not very informative. Thus a bias against very short
> > documents is sometimes useful.
> >
> > > I can certainly go through a bunch of trial-and-error efforts, but it
> would
> > > help if I had some grasp of the logic initially.
> > >
> > > For example, from DefaultSimilarity, here's the lengthNorm() method:
> > >
> > > public float lengthNorm(String fieldName, int numTerms) {
> > > return (float)(1.0 / Math.sqrt(numTerms));
> > > }
> > >
> > > Should I (for the purpose of eliminating any size bias) override it to
> > > always return a 1?
> >
> > That's something to try, although, as mentioned above, I suspect your
> > top hits will be dominated by long documents. Try it. It's really not
> > a difficult experiment!
> >
> > One trick I've used to keep very short documents from dominating
> > results, that, while good matches, are not informative documents, is to
> > override this with something like:
> >
> > public float lengthNorm(String fieldName, int numTerms) {
> > super.lengthNorm(fieldName, Math.max(numTerms, 100));
> > }
> >
> > This way all fields shorter than 100 terms are scored like fields
> > containing 100 terms. Long documents are still normalized, but search
> > is biased a bit against very short documents.
> >
> > > How would I boost the headline field here? Is that how you are
supposed
> to
> > > use the (presently unused) fieldName parameter? If that's the case, I
> > > assume I would logically (to do what I'm trying to do) make this
factor
> > > greater than 1 for the 'headline' field, and 1 for all other fields?
> >
> > You could do that here too. So, for example, you could do something
like:
> >
> > public float lengthNorm(String fieldName, int numTerms) {
> > float n = super.lengthNorm(fieldName, Math.max(numTerms, 100));
> > if (fieldName.equals("headline"))
> > n *= 4.0f;
> > return n;
> > }
> >
> > Equivalently, you could create your documents with something like:
> >
> > Document d = new Document();
> > Field f = new Field.Text("headline", headline);
> > f.setBoost(4.0f);
> > ...
> >
> > But headlines tend to be short, and naturally benefit from the default
> > lengthNorm implementation. So what you really might want is something
> like:
> >
> > public float lengthNorm(String fieldName, int numTerms) {
> > if (fieldName.equals("headline"))
> > return 4.0f * super.lengthNorm(fieldName, numTerms);
> > else
> > return super.lengthNorm(fieldName, Math.max(numTerms, 100));
> > }
> >
> > This is probably what I'd try first.
> >
> > Doug
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]