I'm not sure why Lucene's standard scoring is not working for your duplicate detection, but I'm not entirely surprised, since it's not designed specifically for that. Could you tell me more about your duplicate criteria? Maybe Lucene's query mechanism can be made to work... Since you're effectively doing your own scoring, instead of making a separate query, one for each term, then combining these results by hand, it would be much faster to use IndexReader.termDocs() for each term, and merge these lists of document numbers yourself. To keep things fast, be sure not to call IndexReader.document(int) in your inner loop! Doug -----Original Message----- From: Scott Ganyo [mailto:[EMAIL PROTECTED]] Sent: Wednesday, July 11, 2001 11:19 AM To: 'Doug Cutting'; 'Alex Murzaku'; [EMAIL PROTECTED] Subject: RE: [Lucene-users] score filter Thanks for that, Doug! While we're on the topic of scoring and filtering, maybe you would have some advice on performing duplicate checking between Documents in the index. Each of our Documents are basically a name and address record and we try to report on the probability of a given record or records matching any of the others in the database. The developer that wrote our current implementation claimed that he could only really get good, relevant scores by performing a series of Queries (one per Term) for each document and then aggregating them by hand rather than performing a single Query per Document to get the score. This is what we have in place now and it does seem to work, but all those Queries cause the report to be *very* slow. Personally, I keep thinking there must be a better/faster way to get what we want but I haven't yet had time to delve into the depths of the scorer. Has anyone done anything like this? Do you have any suggestions? Thanks, Scott -----Original Message----- From: Doug Cutting [mailto:[EMAIL PROTECTED]] Sent: Wednesday, July 11, 2001 12:54 PM To: 'Alex Murzaku'; [EMAIL PROTECTED] Subject: RE: [Lucene-users] score filter > From: Doug Cutting [mailto:[EMAIL PROTECTED]] > > For the record, Lucene's scoring algorithm is, roughly: > score_d = sum_t(tf_q*idf_t/norm_q * tf_d*idf_t/norm_d_t) > where: > score_d : score for document d > sum_t : sum for all terms t > tf_q : the square root of the frequency of t in the query > tf_d : the square root of the frequency of t in d > idf_t : log(numDocs/docFreq_t+1) + 1.0 > numDocs : number of documents in index > docFreq_t : number of documents containing t > norm_q : sqrt(sum_t((tf_q*idf_t)^2)) > norm_d_t : square root of number of tokens in d in the same > field as t > > (I hope that's right!) Make that: score_d = sum_t(tf_q*idf_t/norm_q * tf_d*idf_t/norm_d_t * boost_t) * coord_q_d where boost_t : the user-specified boost for term t coord_q_d : number of terms in both query and document / number of terms in query The coordination factor gives an AND-like boost to documents that contain, e.g., all three terms in a three word query over those that contain just two of the words. Doug _______________________________________________ Lucene-users mailing list [EMAIL PROTECTED] http://lists.sourceforge.net/lists/listinfo/lucene-users _______________________________________________ Lucene-users mailing list [EMAIL PROTECTED] http://lists.sourceforge.net/lists/listinfo/lucene-users