I'm not sure why Lucene's standard scoring is not working for your duplicate
detection, but I'm not entirely surprised, since it's not designed
specifically for that.  Could you tell me more about your duplicate
criteria?  Maybe Lucene's query mechanism can be made to work...

Since you're effectively doing your own scoring, instead of making a
separate query, one for each term, then combining these results by hand, it
would be much faster to use IndexReader.termDocs() for each term, and merge
these lists of document numbers yourself.  To keep things fast, be sure not
to call IndexReader.document(int) in your inner loop!

Doug

-----Original Message-----
From: Scott Ganyo [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, July 11, 2001 11:19 AM
To: 'Doug Cutting'; 'Alex Murzaku'; [EMAIL PROTECTED]
Subject: RE: [Lucene-users] score filter


Thanks for that, Doug! 
While we're on the topic of scoring and filtering, maybe you would have some
advice on performing duplicate checking between Documents in the index.
Each of our Documents are basically a name and address record and we try to
report on the probability of a given record or records matching any of the
others in the database.
The developer that wrote our current implementation claimed that he could
only really get good, relevant scores by performing a series of Queries (one
per Term) for each document and then aggregating them by hand rather than
performing a single Query per Document to get the score.  This is what we
have in place now and it does seem to work, but all those Queries cause the
report to be *very* slow.  Personally, I keep thinking there must be a
better/faster way to get what we want but I haven't yet had time to delve
into the depths of the scorer.
Has anyone done anything like this?  Do you have any suggestions? 
Thanks, 
Scott 
-----Original Message----- 
From: Doug Cutting [mailto:[EMAIL PROTECTED]] 
Sent: Wednesday, July 11, 2001 12:54 PM 
To: 'Alex Murzaku'; [EMAIL PROTECTED] 
Subject: RE: [Lucene-users] score filter 


> From: Doug Cutting [mailto:[EMAIL PROTECTED]] 
> 
> For the record, Lucene's scoring algorithm is, roughly: 
>   score_d = sum_t(tf_q*idf_t/norm_q * tf_d*idf_t/norm_d_t) 
> where: 
>   score_d : score for document d 
>   sum_t : sum for all terms t 
>   tf_q : the square root of the frequency of t in the query 
>   tf_d : the square root of the frequency of t in d 
>   idf_t : log(numDocs/docFreq_t+1) + 1.0 
>   numDocs : number of documents in index 
>   docFreq_t : number of documents containing t 
>   norm_q : sqrt(sum_t((tf_q*idf_t)^2)) 
>   norm_d_t : square root of number of tokens in d in the same 
> field as t 
> 
> (I hope that's right!) 
Make that: 
  score_d = sum_t(tf_q*idf_t/norm_q * tf_d*idf_t/norm_d_t * boost_t) * 
coord_q_d 
where 
  boost_t : the user-specified boost for term t 
  coord_q_d : number of terms in both query and document / number of terms 
in query 
The coordination factor gives an AND-like boost to documents that contain, 
e.g., all three terms in a three word query over those that contain just two

of the words. 


Doug 
_______________________________________________ 
Lucene-users mailing list 
[EMAIL PROTECTED] 
http://lists.sourceforge.net/lists/listinfo/lucene-users 

_______________________________________________
Lucene-users mailing list
[EMAIL PROTECTED]
http://lists.sourceforge.net/lists/listinfo/lucene-users

Reply via email to