On 3/29/07, Otis Gospodnetic <[EMAIL PROTECTED]> wrote:
Hm, removing duplicates (as determined by a value of a specified document field) from the results would be nice. How would your addition affect performance, considering it has to check the PQ for a previous value for every candidate hit?
We're doing this in solr (slightly hacked up), based on a field. The penalty was fairly substantial, if you just look at the performance of the queue itself, but the overall performance isn't bad. (Probably because solr's fast, and caches nicely). We just build a FieldSortedHitQueue equivalent, based on the standard java.util.PriorityQueue, instead of org.apache.lucene.util.PriorityQueue. I'm sure it could be done in a more efficient fashion, but since performance has been acceptable, I haven't bothered to try to improve it. Tom Otis
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpy -- http://www.simpy.com/ - Tag - Search - Share ----- Original Message ---- From: Peter Keegan <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Thursday, March 29, 2007 9:39:13 AM Subject: FieldSortedHitQueue enhancement This is request for an enhancement to FieldSortedHitQueue/PriorityQueue that would prevent duplicate documents from being inserted, or alternatively, allow the application to prevent this (reason explained below). I can do this today by making the 'lessThan' method public and checking the queue before inserting like this: if (hq.size() < maxSize) { // doc will be inserted into queue - check for duplicate before inserting } else if (hq.size() > 0 && !hq.lessThan((ScoreDoc)fieldDoc, (ScoreDoc)hq.top()) { // doc will be inserted into queue - check for duplicate before inserting } else { // doc will not be inserted - no check needed } However, this is just replicating existing code in PriorityQueue->insert(). An alternative would be to have a method like: public boolean wouldBeInserted(ScoreDoc doc) // returns true if doc would be inserted, without inserting The reason for this is that I have some queries that get expanded into multiple searches and the resulting hits are OR'd together. The queries contain 'terms' that are not seen by Lucene but are handled by a HitCollector that uses external data for each document to evaluate hits. The results from the priority queue should contain no duplicate documents (first or last doc wins). Do any of these suggestions seem reasonable?. So far, I've been able to use Lucene without any modifications, and hope to continue this way. Peter --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]