I've got a similar duplicate case, but my duplicates are based on an
external ID
rather than Doc id so occurs for a single Query. It's using a custom
HitCollector but score based, not field sorted.
If my duplicate contains a higher score than one on the PQ I need to
update the
stored score with the higher one, so PQ needs a replace() method where
the
stored object.equals() can be used to find the object to delete. I'm not
sure
if there's a way to find the object efficiently in this case other than a
linear
search. I implemented remove().
Peter, how did you achieve 'last wins' as you must presumably remove
first
from
the PQ?
Antony
Peter Keegan wrote:
> The duplicate check would just be on the doc ID. I'm using TreeSet to
> detect
> duplicates with no noticeable affect on performance. The PQ only has to
be
> checked for a previous value IFF the element about to be inserted is
> actually inserted and not dropped because it's less than the least
value
> already in there. So, the TreeSet is never bigger than the size of the
PQ
> (typically 25 to a few hundred items), not the size of all hits.
>
> Peter
>
> On 3/29/07, Otis Gospodnetic <[EMAIL PROTECTED]> wrote:
>>
>> Hm, removing duplicates (as determined by a value of a specified
document
>> field) from the results would be nice.
>> How would your addition affect performance, considering it has to
check
>> the PQ for a previous value for every candidate hit?
>>
>> Otis
>> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
>> Simpy -- http://www.simpy.com/ - Tag - Search - Share
>>
>> ----- Original Message ----
>> From: Peter Keegan <[EMAIL PROTECTED]>
>> To: java-user@lucene.apache.org
>> Sent: Thursday, March 29, 2007 9:39:13 AM
>> Subject: FieldSortedHitQueue enhancement
>>
>> This is request for an enhancement to
FieldSortedHitQueue/PriorityQueue
>> that
>> would prevent duplicate documents from being inserted, or
alternatively,
>> allow the application to prevent this (reason explained below). I can
do
>> this today by making the 'lessThan' method public and checking the
queue
>> before inserting like this:
>>
>> if (hq.size() < maxSize) {
>> // doc will be inserted into queue - check for duplicate before
>> inserting
>> } else if (hq.size() > 0 && !hq.lessThan((ScoreDoc)fieldDoc,
>> (ScoreDoc)hq.top()) {
>> // doc will be inserted into queue - check for duplicate before
>> inserting
>> } else {
>> // doc will not be inserted - no check needed
>> }
>>
>> However, this is just replicating existing code in
>> PriorityQueue->insert().
>> An alternative would be to have a method like:
>>
>> public boolean wouldBeInserted(ScoreDoc doc)
>> // returns true if doc would be inserted, without inserting
>>
>> The reason for this is that I have some queries that get expanded into
>> multiple searches and the resulting hits are OR'd together. The
queries
>> contain 'terms' that are not seen by Lucene but are handled by a
>> HitCollector that uses external data for each document to evaluate
hits.
>> The
>> results from the priority queue should contain no duplicate documents
>> (first
>> or last doc wins).
>>
>> Do any of these suggestions seem reasonable?. So far, I've been
able to
>> use
>> Lucene without any modifications, and hope to continue this way.
>>
>> Peter