Is it possible to access each individual frequency in the score formula so I could
calculate and show the score for each of the sub query?
Regards,
Hui
-----Original Message-----
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]]
Sent: Thu 1/23/2003 10:58 AM
To: Lucene Users List; [EMAIL PROTECTED]
Cc:
Subject: Re: Interpreting the score asociated with the Term? |
Here is a simplified explanation of some basic stuff.
1. the more frequent the term (in a collection) the lower its weight
(significance). Makes sense - very popular words don't distinguish one
document from the other much, because they are present in so many docs.
2. the more frequent a word in a single document, the higher the
documents 'value' when the query contains that word. So the score goes
up for frequent words in a document, esp. if they are not frequent in
other documents in the collection.
3. there is a boost factor which allow you to boost certain terms at
query time (e.g. you value matches in title field more than the body
field? boost title field queries)
4. normalization factor, I believe, normalizes things so that longer
documents don't have advantage over shorter ones.
There is more to this....but I am already not 100% about all of the
above, so I'll stop here :)
Also note that you can boost fields at index time (you'll have to use
the nightly build for that instead of the 1.2 release to get this, I
believe).
Otis
--- Rishabh Bajpai <[EMAIL PROTECTED]> wrote:
>
> Hi All,
>
> I am using Lucene as a Search Engine for my work. I am new to this,
> so forgive me if I am asking a cliched question!
>
> I need to understand how the SCORE for the search TERMs is calculated
> for Lucene, so that indexing can be appropriately be designed to
> return the most relevant results, when searched.
>
> On the official FAQ page of the Lucene site, a formula is listed as
> score_d = sum_t(tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t *
> boost_t) * coord_q_d
> where:
> score_d : score for document d
> sum_t : sum for all terms t
> tf_q : the square root of the frequency of t in the query
> tf_d : the square root of the frequency of t in d
> idf_t : log(numDocs/docFreq_t+1) + 1.0
> numDocs : number of documents in index
> docFreq_t : number of documents containing t
> norm_q : sqrt(sum_t((tf_q*idf_t)^2))
> norm_d_t : square root of number of tokens in d in the same field
> as t
> boost_t : the user-specified boost for term t
> coord_q_d : number of terms in both query and document / number of
> terms in query
>
> I didnot find the formula too helpful in figuring out what exactly
> the score is trying to calculate.
>
> I want to know of a logic that can be used for translating this score
> into something that can be used for determining which Terms are more
> relevant for a given Search Request.
>
> One way would be to just assume that - higher the score, more
> relveant is the search. But is this assumption really valid? Or are
> there any possible caveats to this?
>
> -Rishabh
>
>
>
> _____________________________________________________________
> Get 25MB, POP3, Spam Filtering with LYCOS MAIL PLUS for $19.95/year.
> http://login.mail.lycos.com/brandPage.shtml?pageId=plus&ref=lmtplus
>
> --
> To unsubscribe, e-mail:
> <mailto:[EMAIL PROTECTED]>
> For additional commands, e-mail:
> <mailto:[EMAIL PROTECTED]>
>
__________________________________________________
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com
--
To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]