Thank for your answer, In Lucene scoring function, they use only norm_q, but for one query, norm_q is the same for all documents. So norm_q is actually not effect the score. But norm_d is different, each document has a different norm_d; it effect the score of document d for query q. If you drop it, the score information is not correct anymore or it not space vector model anymore. Could you explain it a little bit.
I think that it's expensive to computed in incremetal indexing because when one document is added, idf of each term changed. But drop it is not a good choice. What is the role of norm_d_t ? Nhan. --- Chuck Williams <[EMAIL PROTECTED]> wrote: > Nhan, > > Re. your two differences: > > 1 is not a difference. Norm_d and Norm_q are both > independent of t, so summing over t has no effect on > them. I.e., Norm_d * Norm_q is constant wrt the > summation, so it doesn't matter if the sum is over > just the numerator or over the entire fraction, the > result is the same. > > 2 is a difference. Lucene uses Norm_q instead of > Norm_d because Norm_d is too expensive to compute, > especially in the presence of incremental indexing. > E.g., adding or deleting any document changes the > idf's, so if Norm_d was used it would have to be > recomputed for ALL documents. This is not feasible. > > Another point you did not mention is that the idf > term is squared (in both of your formulas). Salton, > the originator of the vector space model, dropped > one idf factor from his formula as it improved > results empirically. More recent theoretical > justifications of tf*idf provide intuitive > explanations of why idf should only be included > linearly. tf is best thought of as the real vector > entry, while idf is a weighting term on the > components of the inner product. E.g., seen the > excellent paper by Robertson, "Understanding inverse > document frequency: on theoretical arguments for > IDF", available here: > http://www.emeraldinsight.com/rpsv/cgi-bin/emft.pl > if you sign up for an eval. > > It's easy to correct for idf^2 by using a customer > Similarity that takes a final square root. > > Chuck > > > -----Original Message----- > > From: Vikas Gupta [mailto:[EMAIL PROTECTED] > > Sent: Tuesday, December 14, 2004 9:32 PM > > To: Lucene Users List > > Subject: Re: A question about scoring function > in Lucene > > > > Lucene uses the vector space model. To > understand that: > > > > -Read section 2.1 of "Space optimizations for > Total Ranking" paper > > (Linked > > here > http://lucene.sourceforge.net/publications.html) > > -Read section 6 to 6.4 of > > > http://www.csee.umbc.edu/cadip/readings/IR.report.120600.book.pdf > > -Read section 1 of > > > http://www.cs.utexas.edu/users/inderjit/courses/dm2004/lecture5.ps > > > > Vikas > > > > On Tue, 14 Dec 2004, Nhan Nguyen Dang wrote: > > > > > Hi all, > > > Lucene score document based on the correlation > between > > > the query q and document t: > > > (this is raw function, I don't pay attention > to the > > > boost_t, coord_q_d factor) > > > > > > score_d = sum_t( tf_q * idf_t / norm_q * tf_d > * idf_t > > > / norm_d_t) (*) > > > > > > Could anybody explain it in detail ? Or are > there any > > > papers, documents about this function ? > Because: > > > > > > I have also read the book: Modern Information > > > Retrieval, author: Ricardo Baeza-Yates and > Berthier > > > Ribeiro-Neto, Addison Wesley (Hope you have > read it > > > too). In page 27, they also suggest a scoring > funtion > > > for vector model based on the correlation > between > > > query q and document d as follow (I use > different > > > symbol): > > > > > > sum_t( weight_t_d * weight_t_q) > > > score_d(d, q)= > --------------------------------- (**) > > > norm_d * norm_q > > > > > > where weight_t_d = tf_d * idf_t > > > weight_t_q = tf_q * idf_t > > > norm_d = sqrt( sum_t( (tf_d * idf_t)^2 ) > ) > > > norm_q = sqrt( sum_t( (tf_q * idf_t)^2 ) > ) > > > > > > (**): sum_t( tf_q*idf_t * tf_d*idf_t) > > > score_d(d, > q)=--------------------------------- (***) > > > norm_d * norm_q > > > > > > The two function, (*) and (***), have 2 > differences: > > > 1. in (***), the sum_t is just for the > numerator but > > > in the (*), the sum_t is for everything. So, > with > > > norm_q = sqrt(sum_t((tf_q*idf_t)^2)); sum_t is > > > calculated twice. Is this right? please > explain. > > > > > > 2. No factor that define norms of the > document: norm_d > > > in the function (*). Can you explain this. > what is the > > > role of factor norm_d_t ? > > > > > > One more question: could anybody give me > documents, > > > papers that explain this function in detail. > so when I > > > apply Lucene for my system, I can adapt the > document, > > > and the field so that I still receive the > correct > > > scoring information from Lucene . > > > > > > Best regard, > > > Thanks every body, > > > > > > ===== > > > Ð#7863;ng Nhân > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: > [EMAIL PROTECTED] > > For additional commands, e-mail: > [EMAIL PROTECTED] > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: > [EMAIL PROTECTED] > For additional commands, e-mail: > [EMAIL PROTECTED] > > ===== Ð#7863;ng Nhân __________________________________ Do you Yahoo!? Yahoo! Mail - Helps protect you from nasty viruses. http://promotions.yahoo.com/new_mail --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]