RE: A question about scoring function in Lucene

Nhan Nguyen Dang Wed, 15 Dec 2004 02:09:33 -0800

Thank for your answer,
In Lucene scoring function, they use only norm_q,
but for one query, norm_q is the same for all
documents.
So norm_q is actually not effect the score.
But norm_d is different, each document has a different
norm_d; it effect the score of document d for query q.
If you drop it, the score information is not correct
anymore or it not space vector model anymore.  Could
you explain it a little bit.


I think that it's expensive to computed in incremetal
indexing because when one document is added, idf of
each term changed. But drop it is not a good choice.

What is the role of norm_d_t ?
Nhan.

--- Chuck Williams <[EMAIL PROTECTED]> wrote:

> Nhan,
> 
> Re.  your two differences:
> 
> 1 is not a difference.  Norm_d and Norm_q are both
> independent of t, so summing over t has no effect on
> them.  I.e., Norm_d * Norm_q is constant wrt the
> summation, so it doesn't matter if the sum is over
> just the numerator or over the entire fraction, the
> result is the same.
> 
> 2 is a difference.  Lucene uses Norm_q instead of
> Norm_d because Norm_d is too expensive to compute,
> especially in the presence of incremental indexing. 
> E.g., adding or deleting any document changes the
> idf's, so if Norm_d was used it would have to be
> recomputed for ALL documents.  This is not feasible.
> 
> Another point you did not mention is that the idf
> term is squared (in both of your formulas).  Salton,
> the originator of the vector space model, dropped
> one idf factor from his formula as it improved
> results empirically.  More recent theoretical
> justifications of tf*idf provide intuitive
> explanations of why idf should only be included
> linearly.  tf is best thought of as the real vector
> entry, while idf is a weighting term on the
> components of the inner product.  E.g., seen the
> excellent paper by Robertson, "Understanding inverse
> document frequency: on theoretical arguments for
> IDF", available here: 
> http://www.emeraldinsight.com/rpsv/cgi-bin/emft.pl
> if you sign up for an eval.
> 
> It's easy to correct for idf^2 by using a customer
> Similarity that takes a final square root.
> 
> Chuck
> 
>   > -----Original Message-----
>   > From: Vikas Gupta [mailto:[EMAIL PROTECTED]
>   > Sent: Tuesday, December 14, 2004 9:32 PM
>   > To: Lucene Users List
>   > Subject: Re: A question about scoring function
> in Lucene
>   > 
>   > Lucene uses the vector space model. To
> understand that:
>   > 
>   > -Read section 2.1 of "Space optimizations for
> Total Ranking" paper
>   > (Linked
>   > here
> http://lucene.sourceforge.net/publications.html)
>   > -Read section 6 to 6.4 of
>   >
>
http://www.csee.umbc.edu/cadip/readings/IR.report.120600.book.pdf
>   > -Read section 1 of
>   >
>
http://www.cs.utexas.edu/users/inderjit/courses/dm2004/lecture5.ps
>   > 
>   > Vikas
>   > 
>   > On Tue, 14 Dec 2004, Nhan Nguyen Dang wrote:
>   > 
>   > > Hi all,
>   > > Lucene score document based on the correlation
> between
>   > > the query q and document t:
>   > > (this is raw function, I don't pay attention
> to the
>   > > boost_t, coord_q_d factor)
>   > >
>   > > score_d = sum_t( tf_q * idf_t / norm_q * tf_d
> * idf_t
>   > > / norm_d_t)  (*)
>   > >
>   > > Could anybody explain it in detail ? Or are
> there any
>   > > papers, documents about this function ?
> Because:
>   > >
>   > > I have also read the book: Modern Information
>   > > Retrieval, author: Ricardo Baeza-Yates and
> Berthier
>   > > Ribeiro-Neto, Addison Wesley (Hope you have
> read it
>   > > too). In page 27, they also suggest a scoring
> funtion
>   > > for vector model based on the correlation
> between
>   > > query q and document d as follow (I use
> different
>   > > symbol):
>   > >
>   > >                  sum_t( weight_t_d * weight_t_q)
>   > > score_d(d, q)= 
> --------------------------------- (**)
>   > >                       norm_d * norm_q
>   > >
>   > > where weight_t_d = tf_d * idf_t
>   > >       weight_t_q = tf_q * idf_t
>   > >       norm_d = sqrt( sum_t( (tf_d * idf_t)^2 )
> )
>   > >       norm_q = sqrt( sum_t( (tf_q * idf_t)^2 )
> )
>   > >
>   > > (**):          sum_t( tf_q*idf_t * tf_d*idf_t)
>   > > score_d(d,
> q)=---------------------------------  (***)
>   > >                    norm_d * norm_q
>   > >
>   > > The two function, (*) and (***), have 2
> differences:
>   > > 1. in (***), the sum_t is just for the
> numerator but
>   > > in the (*), the sum_t is for everything. So,
> with
>   > > norm_q = sqrt(sum_t((tf_q*idf_t)^2)); sum_t is
>   > > calculated twice. Is this right? please
> explain.
>   > >
>   > > 2. No factor that define norms of the
> document: norm_d
>   > > in the function (*). Can you explain this.
> what is the
>   > > role of factor norm_d_t ?
>   > >
>   > > One more question: could anybody give me
> documents,
>   > > papers that explain this function in detail.
> so when I
>   > > apply Lucene for my system, I can adapt the
> document,
>   > > and the field so that I still receive the
> correct
>   > > scoring information from Lucene .
>   > >
>   > > Best regard,
>   > > Thanks every body,
>   > >
>   > > =====
>   > > Ð#7863;ng Nhân
>   > 
>   >
>
---------------------------------------------------------------------
>   > To unsubscribe, e-mail:
> [EMAIL PROTECTED]
>   > For additional commands, e-mail:
> [EMAIL PROTECTED]
> 
> 
>
---------------------------------------------------------------------
> To unsubscribe, e-mail:
> [EMAIL PROTECTED]
> For additional commands, e-mail:
> [EMAIL PROTECTED]
> 
> 


=====
Ð#7863;ng Nhân 





                
__________________________________ 
Do you Yahoo!? 
Yahoo! Mail - Helps protect you from nasty viruses. 
http://promotions.yahoo.com/new_mail

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: A question about scoring function in Lucene

Reply via email to