Peter, how did you achieve 'last wins' as you must presumably remove first
from the PQ?
I implemented 'first wins' because the score is less important than other fields (distance, in our case), but you make a good point since score may be more important. How did you implement remove()? Peter On 3/29/07, Antony Bowesman <[EMAIL PROTECTED]> wrote:
I've got a similar duplicate case, but my duplicates are based on an external ID rather than Doc id so occurs for a single Query. It's using a custom HitCollector but score based, not field sorted. If my duplicate contains a higher score than one on the PQ I need to update the stored score with the higher one, so PQ needs a replace() method where the stored object.equals() can be used to find the object to delete. I'm not sure if there's a way to find the object efficiently in this case other than a linear search. I implemented remove(). Peter, how did you achieve 'last wins' as you must presumably remove first from the PQ? Antony Peter Keegan wrote: > The duplicate check would just be on the doc ID. I'm using TreeSet to > detect > duplicates with no noticeable affect on performance. The PQ only has to be > checked for a previous value IFF the element about to be inserted is > actually inserted and not dropped because it's less than the least value > already in there. So, the TreeSet is never bigger than the size of the PQ > (typically 25 to a few hundred items), not the size of all hits. > > Peter > > On 3/29/07, Otis Gospodnetic <[EMAIL PROTECTED]> wrote: >> >> Hm, removing duplicates (as determined by a value of a specified document >> field) from the results would be nice. >> How would your addition affect performance, considering it has to check >> the PQ for a previous value for every candidate hit? >> >> Otis >> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . >> Simpy -- http://www.simpy.com/ - Tag - Search - Share >> >> ----- Original Message ---- >> From: Peter Keegan <[EMAIL PROTECTED]> >> To: java-user@lucene.apache.org >> Sent: Thursday, March 29, 2007 9:39:13 AM >> Subject: FieldSortedHitQueue enhancement >> >> This is request for an enhancement to FieldSortedHitQueue/PriorityQueue >> that >> would prevent duplicate documents from being inserted, or alternatively, >> allow the application to prevent this (reason explained below). I can do >> this today by making the 'lessThan' method public and checking the queue >> before inserting like this: >> >> if (hq.size() < maxSize) { >> // doc will be inserted into queue - check for duplicate before >> inserting >> } else if (hq.size() > 0 && !hq.lessThan((ScoreDoc)fieldDoc, >> (ScoreDoc)hq.top()) { >> // doc will be inserted into queue - check for duplicate before >> inserting >> } else { >> // doc will not be inserted - no check needed >> } >> >> However, this is just replicating existing code in >> PriorityQueue->insert(). >> An alternative would be to have a method like: >> >> public boolean wouldBeInserted(ScoreDoc doc) >> // returns true if doc would be inserted, without inserting >> >> The reason for this is that I have some queries that get expanded into >> multiple searches and the resulting hits are OR'd together. The queries >> contain 'terms' that are not seen by Lucene but are handled by a >> HitCollector that uses external data for each document to evaluate hits. >> The >> results from the priority queue should contain no duplicate documents >> (first >> or last doc wins). >> >> Do any of these suggestions seem reasonable?. So far, I've been able to >> use >> Lucene without any modifications, and hope to continue this way. >> >> Peter >> >> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> For additional commands, e-mail: [EMAIL PROTECTED] >> >> > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]