Doug,
That's excellent. Just what I've been looking for. I'll start
experimenting shortly.
Regards,
Terry
----- Original Message -----
From: "Doug Cutting" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Monday, February 10, 2003 1:57 PM
Subject: Re: Computing Relevancy Differently
> Terry Steichen wrote:
> > Can you give me an idea of what to replace the lengthNorm() method with
to,
> > for example, remove any special weight given to shorter matching
documents?
>
> The goal of the default implementation is not to give any special weight
> to shorter documents, but rather to remove the advantage longer
> documents have. Longer documents are likely to have more matches simply
> because they contain more terms. Also, for the query "foo", a document
> containing just "foo" is a better match than a longer one containing
> "foo bar baz", since the match is more exact.
>
> However, one problem with this approach can be that very short documents
> are in fact not very informative. Thus a bias against very short
> documents is sometimes useful.
>
> > I can certainly go through a bunch of trial-and-error efforts, but it
would
> > help if I had some grasp of the logic initially.
> >
> > For example, from DefaultSimilarity, here's the lengthNorm() method:
> >
> > public float lengthNorm(String fieldName, int numTerms) {
> > return (float)(1.0 / Math.sqrt(numTerms));
> > }
> >
> > Should I (for the purpose of eliminating any size bias) override it to
> > always return a 1?
>
> That's something to try, although, as mentioned above, I suspect your
> top hits will be dominated by long documents. Try it. It's really not
> a difficult experiment!
>
> One trick I've used to keep very short documents from dominating
> results, that, while good matches, are not informative documents, is to
> override this with something like:
>
> public float lengthNorm(String fieldName, int numTerms) {
> super.lengthNorm(fieldName, Math.max(numTerms, 100));
> }
>
> This way all fields shorter than 100 terms are scored like fields
> containing 100 terms. Long documents are still normalized, but search
> is biased a bit against very short documents.
>
> > How would I boost the headline field here? Is that how you are supposed
to
> > use the (presently unused) fieldName parameter? If that's the case, I
> > assume I would logically (to do what I'm trying to do) make this factor
> > greater than 1 for the 'headline' field, and 1 for all other fields?
>
> You could do that here too. So, for example, you could do something like:
>
> public float lengthNorm(String fieldName, int numTerms) {
> float n = super.lengthNorm(fieldName, Math.max(numTerms, 100));
> if (fieldName.equals("headline"))
> n *= 4.0f;
> return n;
> }
>
> Equivalently, you could create your documents with something like:
>
> Document d = new Document();
> Field f = new Field.Text("headline", headline);
> f.setBoost(4.0f);
> ...
>
> But headlines tend to be short, and naturally benefit from the default
> lengthNorm implementation. So what you really might want is something
like:
>
> public float lengthNorm(String fieldName, int numTerms) {
> if (fieldName.equals("headline"))
> return 4.0f * super.lengthNorm(fieldName, numTerms);
> else
> return super.lengthNorm(fieldName, Math.max(numTerms, 100));
> }
>
> This is probably what I'd try first.
>
> Doug
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]