Nhan,

Re.  your two differences:

1 is not a difference.  Norm_d and Norm_q are both independent of t, so summing 
over t has no effect on them.  I.e., Norm_d * Norm_q is constant wrt the 
summation, so it doesn't matter if the sum is over just the numerator or over 
the entire fraction, the result is the same.

2 is a difference.  Lucene uses Norm_q instead of Norm_d because Norm_d is too 
expensive to compute, especially in the presence of incremental indexing.  
E.g., adding or deleting any document changes the idf's, so if Norm_d was used 
it would have to be recomputed for ALL documents.  This is not feasible.

Another point you did not mention is that the idf term is squared (in both of 
your formulas).  Salton, the originator of the vector space model, dropped one 
idf factor from his formula as it improved results empirically.  More recent 
theoretical justifications of tf*idf provide intuitive explanations of why idf 
should only be included linearly.  tf is best thought of as the real vector 
entry, while idf is a weighting term on the components of the inner product.  
E.g., seen the excellent paper by Robertson, "Understanding inverse document 
frequency: on theoretical arguments for IDF", available here:  
http://www.emeraldinsight.com/rpsv/cgi-bin/emft.pl if you sign up for an eval.

It's easy to correct for idf^2 by using a customer Similarity that takes a 
final square root.

Chuck

  > -----Original Message-----
  > From: Vikas Gupta [mailto:[EMAIL PROTECTED]
  > Sent: Tuesday, December 14, 2004 9:32 PM
  > To: Lucene Users List
  > Subject: Re: A question about scoring function in Lucene
  > 
  > Lucene uses the vector space model. To understand that:
  > 
  > -Read section 2.1 of "Space optimizations for Total Ranking" paper
  > (Linked
  > here http://lucene.sourceforge.net/publications.html)
  > -Read section 6 to 6.4 of
  > http://www.csee.umbc.edu/cadip/readings/IR.report.120600.book.pdf
  > -Read section 1 of
  > http://www.cs.utexas.edu/users/inderjit/courses/dm2004/lecture5.ps
  > 
  > Vikas
  > 
  > On Tue, 14 Dec 2004, Nhan Nguyen Dang wrote:
  > 
  > > Hi all,
  > > Lucene score document based on the correlation between
  > > the query q and document t:
  > > (this is raw function, I don't pay attention to the
  > > boost_t, coord_q_d factor)
  > >
  > > score_d = sum_t( tf_q * idf_t / norm_q * tf_d * idf_t
  > > / norm_d_t)  (*)
  > >
  > > Could anybody explain it in detail ? Or are there any
  > > papers, documents about this function ? Because:
  > >
  > > I have also read the book: Modern Information
  > > Retrieval, author: Ricardo Baeza-Yates and Berthier
  > > Ribeiro-Neto, Addison Wesley (Hope you have read it
  > > too). In page 27, they also suggest a scoring funtion
  > > for vector model based on the correlation between
  > > query q and document d as follow (I use different
  > > symbol):
  > >
  > >            sum_t( weight_t_d * weight_t_q)
  > > score_d(d, q)=  --------------------------------- (**)
  > >                 norm_d * norm_q
  > >
  > > where weight_t_d = tf_d * idf_t
  > >       weight_t_q = tf_q * idf_t
  > >       norm_d = sqrt( sum_t( (tf_d * idf_t)^2 ) )
  > >       norm_q = sqrt( sum_t( (tf_q * idf_t)^2 ) )
  > >
  > > (**):          sum_t( tf_q*idf_t * tf_d*idf_t)
  > > score_d(d, q)=---------------------------------  (***)
  > >              norm_d * norm_q
  > >
  > > The two function, (*) and (***), have 2 differences:
  > > 1. in (***), the sum_t is just for the numerator but
  > > in the (*), the sum_t is for everything. So, with
  > > norm_q = sqrt(sum_t((tf_q*idf_t)^2)); sum_t is
  > > calculated twice. Is this right? please explain.
  > >
  > > 2. No factor that define norms of the document: norm_d
  > > in the function (*). Can you explain this. what is the
  > > role of factor norm_d_t ?
  > >
  > > One more question: could anybody give me documents,
  > > papers that explain this function in detail. so when I
  > > apply Lucene for my system, I can adapt the document,
  > > and the field so that I still receive the correct
  > > scoring information from Lucene .
  > >
  > > Best regard,
  > > Thanks every body,
  > >
  > > =====
  > > Ð#7863;ng Nhân
  > 
  > ---------------------------------------------------------------------
  > To unsubscribe, e-mail: [EMAIL PROTECTED]
  > For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to