Re: FieldSortedHitQueue enhancement

Antony Bowesman Thu, 29 Mar 2007 17:31:53 -0800

Peter Keegan wrote:

I implemented 'first wins' because the score is less important than other
fields (distance, in our case), but you make a good point since scoremay be
more important. How did you implement remove()?


I've got my own PriorityQueue

    public boolean remove(E o)
    {
        if (o == null)
            return false;

        for (int i = 1; i <= size; i++)
        {
            if (queue[i] == o)
            {
                removeElement(i);
                return true;
            }
        }
        return false;
    }

I've got a reference to the original object, so I'm using == to locate it. I'venot used equals() as I've not yet worked out whether that will cause me anyproblems with hashing.


Antony


Peter


On 3/29/07, Antony Bowesman <[EMAIL PROTECTED]> wrote:


I've got a similar duplicate case, but my duplicates are based on an
external ID
rather than Doc id so occurs for a single Query.  It's using a custom
HitCollector but score based, not field sorted.

If my duplicate contains a higher score than one on the PQ I need to
update the

stored score with the higher one, so PQ needs a replace() method wherethe

stored object.equals() can be used to find the object to delete.  I'm not
sure
if there's a way to find the object efficiently in this case other than a
linear
search.  I implemented remove().

Peter, how did you achieve 'last wins' as you must presumably removefirst

from
the PQ?

Antony


Peter Keegan wrote:
> The duplicate check would just be on the doc ID. I'm using TreeSet to
> detect
> duplicates with no noticeable affect on performance. The PQ only has to
be
> checked for a previous value IFF the element about to be inserted is

> actually inserted and not dropped because it's less than the leastvalue

> already in there. So, the TreeSet is never bigger than the size of the
PQ
> (typically 25 to a few hundred items), not the size of all hits.
>
> Peter
>
> On 3/29/07, Otis Gospodnetic <[EMAIL PROTECTED]> wrote:
>>
>> Hm, removing duplicates (as determined by a value of a specified
document
>> field) from the results would be nice.

>> How would your addition affect performance, considering it has tocheck

>> the PQ for a previous value for every candidate hit?
>>
>> Otis
>> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
>> Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share
>>
>> ----- Original Message ----
>> From: Peter Keegan <[EMAIL PROTECTED]>
>> To: java-user@lucene.apache.org
>> Sent: Thursday, March 29, 2007 9:39:13 AM
>> Subject: FieldSortedHitQueue enhancement
>>

>> This is request for an enhancement toFieldSortedHitQueue/PriorityQueue

>> that
>> would prevent duplicate documents from being inserted, or
alternatively,
>> allow the application to prevent this (reason explained below). I can
do
>> this today by making the 'lessThan' method public and checking the
queue
>> before inserting like this:
>>
>> if (hq.size() < maxSize) {
>>    // doc will be inserted into queue - check for duplicate before
>> inserting
>> } else if (hq.size() > 0 && !hq.lessThan((ScoreDoc)fieldDoc,
>> (ScoreDoc)hq.top()) {
>>   // doc will be inserted into queue - check for duplicate before
>> inserting
>> } else {
>>   // doc will not be inserted - no check needed
>> }
>>
>> However, this is just replicating existing code in
>> PriorityQueue->insert().
>> An alternative would be to have a method like:
>>
>> public boolean wouldBeInserted(ScoreDoc doc)
>> // returns true if doc would be inserted, without inserting
>>
>> The reason for this is that I have some queries that get expanded into

>> multiple searches and the resulting hits are OR'd together. Thequeries

>> contain 'terms' that are not seen by Lucene but are handled by a
>> HitCollector that uses external data for each document to evaluate
hits.
>> The
>> results from the priority queue should contain no duplicate documents
>> (first
>> or last doc wins).
>>

>> Do any of these suggestions seem reasonable?. So far, I've beenable to

>> use
>> Lucene without any modifications, and hope to continue this way.
>>
>> Peter



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: FieldSortedHitQueue enhancement

Reply via email to