Otis,

Didn't somebody (Doug?) also mention that a keyword in a shorter document is
deemed more significant than in a longer one (because, I guess, it
represents a larger percentage of the document)?

Regards,

Terry
----- Original Message -----
From: "Otis Gospodnetic" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>;
<[EMAIL PROTECTED]>
Sent: Thursday, January 23, 2003 10:58 AM
Subject: Re: Interpreting the score asociated with the Term? |


> Here is a simplified explanation of some basic stuff.
>
> 1. the more frequent the term (in a collection) the lower its weight
> (significance).  Makes sense - very popular words don't distinguish one
> document from the other much, because they are present in so many docs.
>
> 2. the more frequent a word in a single document, the higher the
> documents 'value' when the query contains that word.  So the score goes
> up for frequent words in a document, esp. if they are not frequent in
> other documents in the collection.
>
> 3. there is a boost factor which allow you to boost certain terms at
> query time (e.g. you value matches in title field more than the body
> field?  boost title field queries)
>
> 4. normalization factor, I believe, normalizes things so that longer
> documents don't have advantage over shorter ones.
>
> There is more to this....but I am already not 100% about all of the
> above, so I'll stop here :)
>
> Also note that you can boost fields at index time (you'll have to use
> the nightly build for that instead of the 1.2 release to get this, I
> believe).
>
> Otis
>
>
> --- Rishabh Bajpai <[EMAIL PROTECTED]> wrote:
> >
> > Hi All,
> >
> > I am using Lucene as a Search Engine for my work. I am new to this,
> > so forgive me if I am asking a cliched question!
> >
> > I need to understand how the SCORE for the search TERMs is calculated
> > for Lucene, so that indexing can be appropriately be designed to
> > return the most relevant results, when searched.
> >
> > On the official FAQ page of the Lucene site, a formula is listed as
> > score_d = sum_t(tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t *
> > boost_t) * coord_q_d
> > where:
> >   score_d   : score for document d
> >   sum_t     : sum for all terms t
> >   tf_q      : the square root of the frequency of t in the query
> >   tf_d      : the square root of the frequency of t in d
> >   idf_t     : log(numDocs/docFreq_t+1) + 1.0
> >   numDocs   : number of documents in index
> >   docFreq_t : number of documents containing t
> >   norm_q    : sqrt(sum_t((tf_q*idf_t)^2))
> >   norm_d_t  : square root of number of tokens in d in the same field
> > as t
> >   boost_t   : the user-specified boost for term t
> >   coord_q_d : number of terms in both query and document / number of
> > terms in query
> >
> > I didnot find the formula too helpful in figuring out what exactly
> > the score is trying to calculate.
> >
> > I want to know of a logic that can be used for translating this score
> > into something that can be used for determining which Terms are more
> > relevant for a given Search Request.
> >
> > One way would be to just assume that - higher the score, more
> > relveant is the search. But is this assumption really valid? Or are
> > there any possible caveats to this?
> >
> > -Rishabh
> >
> >
> >
> > _____________________________________________________________
> > Get 25MB, POP3, Spam Filtering with LYCOS MAIL PLUS for $19.95/year.
> > http://login.mail.lycos.com/brandPage.shtml?pageId=plus&ref=lmtplus
> >
> > --
> > To unsubscribe, e-mail:
> > <mailto:[EMAIL PROTECTED]>
> > For additional commands, e-mail:
> > <mailto:[EMAIL PROTECTED]>
> >
>
>
> __________________________________________________
> Do you Yahoo!?
> Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
> http://mailplus.yahoo.com
>
> --
> To unsubscribe, e-mail:
<mailto:[EMAIL PROTECTED]>
> For additional commands, e-mail:
<mailto:[EMAIL PROTECTED]>
>
>


--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>

Reply via email to