Otis, Didn't somebody (Doug?) also mention that a keyword in a shorter document is deemed more significant than in a longer one (because, I guess, it represents a larger percentage of the document)?
Regards, Terry ----- Original Message ----- From: "Otis Gospodnetic" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]> Sent: Thursday, January 23, 2003 10:58 AM Subject: Re: Interpreting the score asociated with the Term? | > Here is a simplified explanation of some basic stuff. > > 1. the more frequent the term (in a collection) the lower its weight > (significance). Makes sense - very popular words don't distinguish one > document from the other much, because they are present in so many docs. > > 2. the more frequent a word in a single document, the higher the > documents 'value' when the query contains that word. So the score goes > up for frequent words in a document, esp. if they are not frequent in > other documents in the collection. > > 3. there is a boost factor which allow you to boost certain terms at > query time (e.g. you value matches in title field more than the body > field? boost title field queries) > > 4. normalization factor, I believe, normalizes things so that longer > documents don't have advantage over shorter ones. > > There is more to this....but I am already not 100% about all of the > above, so I'll stop here :) > > Also note that you can boost fields at index time (you'll have to use > the nightly build for that instead of the 1.2 release to get this, I > believe). > > Otis > > > --- Rishabh Bajpai <[EMAIL PROTECTED]> wrote: > > > > Hi All, > > > > I am using Lucene as a Search Engine for my work. I am new to this, > > so forgive me if I am asking a cliched question! > > > > I need to understand how the SCORE for the search TERMs is calculated > > for Lucene, so that indexing can be appropriately be designed to > > return the most relevant results, when searched. > > > > On the official FAQ page of the Lucene site, a formula is listed as > > score_d = sum_t(tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t * > > boost_t) * coord_q_d > > where: > > score_d : score for document d > > sum_t : sum for all terms t > > tf_q : the square root of the frequency of t in the query > > tf_d : the square root of the frequency of t in d > > idf_t : log(numDocs/docFreq_t+1) + 1.0 > > numDocs : number of documents in index > > docFreq_t : number of documents containing t > > norm_q : sqrt(sum_t((tf_q*idf_t)^2)) > > norm_d_t : square root of number of tokens in d in the same field > > as t > > boost_t : the user-specified boost for term t > > coord_q_d : number of terms in both query and document / number of > > terms in query > > > > I didnot find the formula too helpful in figuring out what exactly > > the score is trying to calculate. > > > > I want to know of a logic that can be used for translating this score > > into something that can be used for determining which Terms are more > > relevant for a given Search Request. > > > > One way would be to just assume that - higher the score, more > > relveant is the search. But is this assumption really valid? Or are > > there any possible caveats to this? > > > > -Rishabh > > > > > > > > _____________________________________________________________ > > Get 25MB, POP3, Spam Filtering with LYCOS MAIL PLUS for $19.95/year. > > http://login.mail.lycos.com/brandPage.shtml?pageId=plus&ref=lmtplus > > > > -- > > To unsubscribe, e-mail: > > <mailto:[EMAIL PROTECTED]> > > For additional commands, e-mail: > > <mailto:[EMAIL PROTECTED]> > > > > > __________________________________________________ > Do you Yahoo!? > Yahoo! Mail Plus - Powerful. Affordable. Sign up now. > http://mailplus.yahoo.com > > -- > To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> > For additional commands, e-mail: <mailto:[EMAIL PROTECTED]> > > -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>
