Re: Interpreting the score asociated with the Term? |

Rishabh Bajpai Sun, 26 Jan 2003 01:51:23 -0800

 
Thanks nevertheless. It was really useful!
I think such basic information should also be put on the FAQ list, just for the sake 
of completeness, if nothign more. The interpretation of the formula is not really very 
straightforward. 
Anyways, that is jst my take on this.
-Rishabh


--

On Sat, 25 Jan 2003 20:41:18  
 Otis Gospodnetic wrote:
>Maybe we'll do this in the future, but what I provided is really basic,
>not Lucene specific info.  If somebody else describes it in detail I'll
>gladly put it up on the site.
>
>Otis
>
>--- Terry Steichen <[EMAIL PROTECTED]> wrote:
>> Otis,
>> 
>> I think the effort you made in your previous message (to describe the
>> basic
>> relevance measures in simple, non-algorithmic terms) is very
>> important.  If
>> you think that list is reasonably comprehensive (that is, it captures
>> most
>> of what relevance means), I'd urge you to insert this into the
>> documentation.  I think it is very valuable.
>> 
>> Regards,
>> 
>> Terry
>> 
>> ----- Original Message -----
>> From: "Otis Gospodnetic" <[EMAIL PROTECTED]>
>> To: "Lucene Users List" <[EMAIL PROTECTED]>
>> Sent: Thursday, January 23, 2003 12:02 PM
>> Subject: Re: Interpreting the score asociated with the Term? |
>> 
>> 
>> > Yes, I believe so.
>> >
>> > --- Terry Steichen <[EMAIL PROTECTED]> wrote:
>> > > Otis,
>> > >
>> > > Didn't somebody (Doug?) also mention that a keyword in a shorter
>> > > document is
>> > > deemed more significant than in a longer one (because, I guess,
>> it
>> > > represents a larger percentage of the document)?
>> > >
>> > > Regards,
>> > >
>> > > Terry
>> > > ----- Original Message -----
>> > > From: "Otis Gospodnetic" <[EMAIL PROTECTED]>
>> > > To: "Lucene Users List" <[EMAIL PROTECTED]>;
>> > > <[EMAIL PROTECTED]>
>> > > Sent: Thursday, January 23, 2003 10:58 AM
>> > > Subject: Re: Interpreting the score asociated with the Term? |
>> > >
>> > >
>> > > > Here is a simplified explanation of some basic stuff.
>> > > >
>> > > > 1. the more frequent the term (in a collection) the lower its
>> > > weight
>> > > > (significance).  Makes sense - very popular words don't
>> distinguish
>> > > one
>> > > > document from the other much, because they are present in so
>> many
>> > > docs.
>> > > >
>> > > > 2. the more frequent a word in a single document, the higher
>> the
>> > > > documents 'value' when the query contains that word.  So the
>> score
>> > > goes
>> > > > up for frequent words in a document, esp. if they are not
>> frequent
>> > > in
>> > > > other documents in the collection.
>> > > >
>> > > > 3. there is a boost factor which allow you to boost certain
>> terms
>> > > at
>> > > > query time (e.g. you value matches in title field more than the
>> > > body
>> > > > field?  boost title field queries)
>> > > >
>> > > > 4. normalization factor, I believe, normalizes things so that
>> > > longer
>> > > > documents don't have advantage over shorter ones.
>> > > >
>> > > > There is more to this....but I am already not 100% about all of
>> the
>> > > > above, so I'll stop here :)
>> > > >
>> > > > Also note that you can boost fields at index time (you'll have
>> to
>> > > use
>> > > > the nightly build for that instead of the 1.2 release to get
>> this,
>> > > I
>> > > > believe).
>> > > >
>> > > > Otis
>> > > >
>> > > >
>> > > > --- Rishabh Bajpai <[EMAIL PROTECTED]> wrote:
>> > > > >
>> > > > > Hi All,
>> > > > >
>> > > > > I am using Lucene as a Search Engine for my work. I am new to
>> > > this,
>> > > > > so forgive me if I am asking a cliched question!
>> > > > >
>> > > > > I need to understand how the SCORE for the search TERMs is
>> > > calculated
>> > > > > for Lucene, so that indexing can be appropriately be designed
>> to
>> > > > > return the most relevant results, when searched.
>> > > > >
>> > > > > On the official FAQ page of the Lucene site, a formula is
>> listed
>> > > as
>> > > > > score_d = sum_t(tf_q * idf_t / norm_q * tf_d * idf_t /
>> norm_d_t *
>> > > > > boost_t) * coord_q_d
>> > > > > where:
>> > > > >   score_d   : score for document d
>> > > > >   sum_t     : sum for all terms t
>> > > > >   tf_q      : the square root of the frequency of t in the
>> query
>> > > > >   tf_d      : the square root of the frequency of t in d
>> > > > >   idf_t     : log(numDocs/docFreq_t+1) + 1.0
>> > > > >   numDocs   : number of documents in index
>> > > > >   docFreq_t : number of documents containing t
>> > > > >   norm_q    : sqrt(sum_t((tf_q*idf_t)^2))
>> > > > >   norm_d_t  : square root of number of tokens in d in the
>> same
>> > > field
>> > > > > as t
>> > > > >   boost_t   : the user-specified boost for term t
>> > > > >   coord_q_d : number of terms in both query and document /
>> number
>> > > of
>> > > > > terms in query
>> > > > >
>> > > > > I didnot find the formula too helpful in figuring out what
>> > > exactly
>> > > > > the score is trying to calculate.
>> > > > >
>> > > > > I want to know of a logic that can be used for translating
>> this
>> > > score
>> > > > > into something that can be used for determining which Terms
>> are
>> > > more
>> > > > > relevant for a given Search Request.
>> > > > >
>> > > > > One way would be to just assume that - higher the score, more
>> > > > > relveant is the search. But is this assumption really valid?
>> Or
>> > > are
>> > > > > there any possible caveats to this?
>> > > > >
>> > > > > -Rishabh
>> > > > >
>> > > > >
>> > > > >
>> > > > > _____________________________________________________________
>> > > > > Get 25MB, POP3, Spam Filtering with LYCOS MAIL PLUS for
>> > > $19.95/year.
>> > > > >
>> > >
>> http://login.mail.lycos.com/brandPage.shtml?pageId=plus&ref=lmtplus
>> > > > >
>> > > > > --
>> > > > > To unsubscribe, e-mail:
>> > > > > <mailto:[EMAIL PROTECTED]>
>> > > > > For additional commands, e-mail:
>> > > > > <mailto:[EMAIL PROTECTED]>
>> > > > >
>> > > >
>> > > >
>> > > > __________________________________________________
>> > > > Do you Yahoo!?
>> > > > Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
>> > > > http://mailplus.yahoo.com
>> > > >
>> > > > --
>> > > > To unsubscribe, e-mail:
>> > > <mailto:[EMAIL PROTECTED]>
>> > > > For additional commands, e-mail:
>> > > <mailto:[EMAIL PROTECTED]>
>> > > >
>> > > >
>> > >
>> > >
>> > > --
>> > > To unsubscribe, e-mail:
>> > > <mailto:[EMAIL PROTECTED]>
>> > > For additional commands, e-mail:
>> > > <mailto:[EMAIL PROTECTED]>
>> > >
>> >
>> >
>> > __________________________________________________
>> > Do you Yahoo!?
>> > Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
>> > http://mailplus.yahoo.com
>> >
>> > --
>> > To unsubscribe, e-mail:
>> <mailto:[EMAIL PROTECTED]>
>> > For additional commands, e-mail:
>> <mailto:[EMAIL PROTECTED]>
>> >
>> >
>> 
>> 
>> --
>> To unsubscribe, e-mail:  
>> <mailto:[EMAIL PROTECTED]>
>> For additional commands, e-mail:
>> <mailto:[EMAIL PROTECTED]>
>> 
>
>
>__________________________________________________
>Do you Yahoo!?
>Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
>http://mailplus.yahoo.com
>
>--
>To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
>For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>
>
>


_____________________________________________________________
Get 25MB, POP3, Spam Filtering with LYCOS MAIL PLUS for $19.95/year.
http://login.mail.lycos.com/brandPage.shtml?pageId=plus&ref=lmtplus

--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>

Re: Interpreting the score asociated with the Term? |

Reply via email to