Peter, how did you achieve 'last wins' as you must presumably remove first
from the PQ?

I implemented 'first wins' because the score is less important than other
fields (distance, in our case), but you make a good point since score may be
more important. How did you implement remove()?

Peter


On 3/29/07, Antony Bowesman <[EMAIL PROTECTED]> wrote:

I've got a similar duplicate case, but my duplicates are based on an
external ID
rather than Doc id so occurs for a single Query.  It's using a custom
HitCollector but score based, not field sorted.

If my duplicate contains a higher score than one on the PQ I need to
update the
stored score with the higher one, so PQ needs a replace() method where the
stored object.equals() can be used to find the object to delete.  I'm not
sure
if there's a way to find the object efficiently in this case other than a
linear
search.  I implemented remove().

Peter, how did you achieve 'last wins' as you must presumably remove first
from
the PQ?

Antony


Peter Keegan wrote:
> The duplicate check would just be on the doc ID. I'm using TreeSet to
> detect
> duplicates with no noticeable affect on performance. The PQ only has to
be
> checked for a previous value IFF the element about to be inserted is
> actually inserted and not dropped because it's less than the least value
> already in there. So, the TreeSet is never bigger than the size of the
PQ
> (typically 25 to a few hundred items), not the size of all hits.
>
> Peter
>
> On 3/29/07, Otis Gospodnetic <[EMAIL PROTECTED]> wrote:
>>
>> Hm, removing duplicates (as determined by a value of a specified
document
>> field) from the results would be nice.
>> How would your addition affect performance, considering it has to check
>> the PQ for a previous value for every candidate hit?
>>
>> Otis
>> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
>> Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share
>>
>> ----- Original Message ----
>> From: Peter Keegan <[EMAIL PROTECTED]>
>> To: java-user@lucene.apache.org
>> Sent: Thursday, March 29, 2007 9:39:13 AM
>> Subject: FieldSortedHitQueue enhancement
>>
>> This is request for an enhancement to FieldSortedHitQueue/PriorityQueue
>> that
>> would prevent duplicate documents from being inserted, or
alternatively,
>> allow the application to prevent this (reason explained below). I can
do
>> this today by making the 'lessThan' method public and checking the
queue
>> before inserting like this:
>>
>> if (hq.size() < maxSize) {
>>    // doc will be inserted into queue - check for duplicate before
>> inserting
>> } else if (hq.size() > 0 && !hq.lessThan((ScoreDoc)fieldDoc,
>> (ScoreDoc)hq.top()) {
>>   // doc will be inserted into queue - check for duplicate before
>> inserting
>> } else {
>>   // doc will not be inserted - no check needed
>> }
>>
>> However, this is just replicating existing code in
>> PriorityQueue->insert().
>> An alternative would be to have a method like:
>>
>> public boolean wouldBeInserted(ScoreDoc doc)
>> // returns true if doc would be inserted, without inserting
>>
>> The reason for this is that I have some queries that get expanded into
>> multiple searches and the resulting hits are OR'd together. The queries
>> contain 'terms' that are not seen by Lucene but are handled by a
>> HitCollector that uses external data for each document to evaluate
hits.
>> The
>> results from the priority queue should contain no duplicate documents
>> (first
>> or last doc wins).
>>
>> Do any of these suggestions seem reasonable?. So far, I've been able to
>> use
>> Lucene without any modifications, and hope to continue this way.
>>
>> Peter
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Reply via email to