Re: Lucene Scoring Behavior

Terry Steichen Wed, 17 Sep 2003 14:33:53 -0700

Doug/Erik,

I do use RangeQuery to get a range of dates, but in this case I'm just
getting a single date (string), so I believe it's just a regular query I'm
using.


Per Erik's suggestion, I checked out the Explanation for some of these
anomolies.  I've included a condensation of the data it generated below
(which I don't frankly don't understand).  Perhaps that will give you or
Erik some insight into what's happening?

Regards,

Terry

PS: I note that the 'docFreq' parameters displayed below correspond exactly
to the number of hits for the query.  Also, here's the Similarity class I'm
using (per an earlier suggestion of Doug):

public class WESimilarity2 extends
org.apache.lucene.search.DefaultSimilarity {

 public float lengthNorm(String fieldName, int numTerms) {
  if (fieldName.equals("headline") || fieldName.equals("summary") ||
fieldName.equals("ssummary")){
   return 4.0f * super.lengthNorm(fieldName, Math.max(numTerms,750));
  } else {
   return super.lengthNorm(fieldName, Math.max(numTerms, 750));
  }
 }
}




Query #1: pub_date:20030917
All items: Score: .23000652
0.23000652 = weight(pub_date:20030917 in 91197), product of:
  0.99999994 = queryWeight(pub_date:20030917), product of:
    7.360209 = idf(docFreq=157)
    0.1358657 = queryNorm
  0.23000653 = fieldWeight(pub_date:20030917 in 91197), product of:
    1.0 = tf(termFreq(pub_date:20030917)=1)
    7.360209 = idf(docFreq=157)
    0.03125 = fieldNorm(field=pub_date, doc=91197)

Query #2: pub_date:20030916
All items: Score: .22295427
0.22295427 = fieldWeight(pub_date:20030916 in 90992), product of:
  1.0 = tf(termFreq(pub_date:20030916)=1)
  7.1345367 = idf(docFreq=197)
  0.03125 = fieldNorm(field=pub_date, doc=90992)


Query #3: pub_date:20030915
Items 1&2: Score: 1.0
7.2580175 = weight(pub_date:20030915 in 90970), product of:
  0.99999994 = queryWeight(pub_date:20030915), product of:
    7.258018 = idf(docFreq=174)
    0.13777865 = queryNorm
  7.258018 = fieldWeight(pub_date:20030915 in 90970), product of:
    1.0 = tf(termFreq(pub_date:20030915)=1)
    7.258018 = idf(docFreq=174)
    1.0 = fieldNorm(field=pub_date, doc=90970)

Query #3 (same as above): pub_date:20030915
Other items: Score: 03125
0.22681305 = weight(pub_date:20030915 in 90826), product of:
  0.99999994 = queryWeight(pub_date:20030915), product of:
    7.258018 = idf(docFreq=174)
    0.13777865 = queryNorm
  0.22681306 = fieldWeight(pub_date:20030915 in 90826), product of:
    1.0 = tf(termFreq(pub_date:20030915)=1)
    7.258018 = idf(docFreq=174)
    0.03125 = fieldNorm(field=pub_date, doc=90826)

Query #4: pub_date:20030914
0.21384604 = weight(pub_date:20030914 in 90417), product of:
  0.99999994 = queryWeight(pub_date:20030914), product of:
    6.843074 = idf(docFreq=264)
    0.14613315 = queryNorm
  0.21384606 = fieldWeight(pub_date:20030914 in 90417), product of:
    1.0 = tf(termFreq(pub_date:20030914)=1)
    6.843074 = idf(docFreq=264)
    0.03125 = fieldNorm(field=pub_date, doc=90417)

Query #5: pub_date 20030913
Items 1&2: Score: 1.0
7.366558 = fieldWeight(pub_date:20030913 in 90591), product of:
  1.0 = tf(termFreq(pub_date:20030913)=1)
  7.366558 = idf(docFreq=156)
  1.0 = fieldNorm(field=pub_date, doc=90591)

Query #5 (same as above): pub_date:20030913
Other items: Score: .03125
0.23020494 = fieldWeight(pub_date:20030913 in 90383), product of:
  1.0 = tf(termFreq(pub_date:20030913)=1)
  7.366558 = idf(docFreq=156)
  0.03125 = fieldNorm(field=pub_date, doc=90383)


----- Original Message -----
From: "Doug Cutting" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Wednesday, September 17, 2003 4:55 PM
Subject: Re: Lucene Scoring Behavior


> If you're using RangeQuery to do date searching, then you'll likely see
> unusual scoring.  The IDF of a date, like any other term, is inversely
> related to the number of documents with that date.  So documents whose
> dates are rare will score higher, which is probably not what you intend.
>
> Using a Filter for date searching is one way to remove dates from the
> scoring calculation.  Another is to provide a Similarity implementation
> that gives an IDF of 1.0 for terms from your date field, e.g., something
> like:
>
> public class MySimilarity extends DefaultSimilarity {
>    public float idf(Term term, Searcher searcher) throws IOException {
>      if (term.field() == "date") {
>        return 1.0f;
>      } else {
>        return super.idf(term, searcher);
>      }
>    }
> }
>
> Or you could just give date clauses of your query a very small boost
> (e.g., .0001) so that other clauses dominate the scoring.
>
> Doug
>
> Terry Steichen wrote:
> > I've run across some puzzling behavior regarding scoring.  I have a set
of documents which contain, among others, a date field (whose contents is a
string in the YYYYMMDD format).  When I query on the date 20030917 (that is,
today), I get 157 hits, all of which have a score of .23000652.  If I use
20030916 (yesterday), I get 197 hits, each of which has a score of
.22295427.
> >
> > So far, all seems logical.  However, when I search for all records for
the date 20030915, the first two (of 174 hits) have a score of 1.0, while
all the rest of the hits have a score of .03125.  Here is a tabulation of
these and a few more queries:
> >
> > Query Date      Result
> > =======        ========================
> > 20030917        all have a score of .23000652 (157)
> > 20030916        all have a score of .22295427 (197)
> > 20030915        first 2 have a 1.0 score, all rest are .03125 (174)
> > 20030914        all have a score of .21384604 (264)
> > 20030913        first 2 have a 1.0 score, all rest are .03125 (156)
> > 20030912        all have a score .2166833 (241)
> > 20030911        first 3 have a 1.0 score, all rest are .03125 (244)
> > 20030910        all have a score of  .2208193 (211)
> >
> > I would expect that all the hits would have the same score, and I would
expect it to be normalized to 1 (unless, I guess, the top score was less
than 1, in which case normalization presumably doesn't occur).
> >
> > Does anyone have any ideas as to what might be going on here?  (I'm
using the latest CVS sources, obtained this afternoon.)
> >
> > Regards,
> >
> > Terry
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene Scoring Behavior

Reply via email to