Doug,

I just extracted a portion of the database, reindexed and found the scores
to come out much more like we'd expect.  Appears this may be an indexing
issue - I index new stuff each day and merge the new index with the master
index.  Only redo the master when I can't avoid it (because it takes so
long).  I probably merge 100 times or more before reindexing.  This evening
I'll reindex and let you know if the apparent problem clears up.  If so,
I'll keep track of it as I continue to merge and see if there's any issue
there.

Thanks for the input (and from Erik, pointing me to the Explanation - it's
pretty neat).

Question: The new scores for the test database portion mentioned above all
seem to come out in the range of .06 to .07.  I assume this is because they
never get normalized.  If this is the case, (a) would it hurt anything to
"normalize up" (so the scores range up to 1), and if so (b) is there an
easy, non-disruptive (to the source code) way to do this?

Regards,

Terry


----- Original Message -----
From: "Doug Cutting" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Wednesday, September 17, 2003 11:15 PM
Subject: Re: Lucene Scoring Behavior


> Hmm.  This makes no sense to me.  Can you supply a reproducible
> standalone test case?
>
> Doug
>
> Terry Steichen wrote:
> > Doug,
> >
> > (1) No, I did *not* boost the pub_date field, either in the indexing
process
> > or in the query itself.
> >
> > (2) And, each pub_date field of each document (which is in XML format)
> > contains only one instance of the date string.
> >
> > (3) And only the pub_date field itself is indexed.  There are other
> > attributes of this field that may contain the date string, but they
aren't
> > indexed - that is, they are not included in the instantiated Document
class.
> >
> > Regards,
> >
> > Terry
> >
> > ----- Original Message -----
> > From: "Doug Cutting" <[EMAIL PROTECTED]>
> > To: "Lucene Users List" <[EMAIL PROTECTED]>
> > Sent: Wednesday, September 17, 2003 5:51 PM
> > Subject: Re: Lucene Scoring Behavior
> >
> >
> >
> >>Terry Steichen wrote:
> >>
> >>>  0.03125 = fieldNorm(field=pub_date, doc=90992)
> >>>  1.0 = fieldNorm(field=pub_date, doc=90970)
> >>
> >>It looks like the fieldNorm's are what differ, not the IDFs.  These are
> >>the product of the document and/or field boost, and 1/sqrt(numTerms)
> >>where numTerms is the number of terms in the "pub_date" field of the
> >>document.  Thus if each document is only assigned one date, and you
> >>didn't boost the field or the document when you indexed it, this should
> >>be 1.0.  But if the document has two dates, then this would be
> >>1/sqrt(2).  Or if you boosted this document pub_date field, then this
> >>will have whatever boost you provided.
> >>
> >>So, did you boost anything when indexing?  Or could a single document
> >>have two or more different values for pub_date?  Either would explain
> >
> > this.
> >
> >>Doug
> >>
> >>
> >>---------------------------------------------------------------------
> >>To unsubscribe, e-mail: [EMAIL PROTECTED]
> >>For additional commands, e-mail: [EMAIL PROTECTED]
> >>
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to