I've got a similar duplicate case, but my duplicates are based on an external ID
rather than Doc id so occurs for a single Query. It's using a custom
HitCollector but score based, not field sorted.
If my duplicate contains a higher score than one on the PQ I need to update the
stored score with the higher one, so PQ needs a replace() method where the
stored object.equals() can be used to find the object to delete. I'm not sure
if there's a way to find the object efficiently in this case other than a linear
search. I implemented remove().
Peter, how did you achieve 'last wins' as you must presumably remove first from
the PQ?
Antony
Peter Keegan wrote:
The duplicate check would just be on the doc ID. I'm using TreeSet to
detect
duplicates with no noticeable affect on performance. The PQ only has to be
checked for a previous value IFF the element about to be inserted is
actually inserted and not dropped because it's less than the least value
already in there. So, the TreeSet is never bigger than the size of the PQ
(typically 25 to a few hundred items), not the size of all hits.
Peter
On 3/29/07, Otis Gospodnetic <[EMAIL PROTECTED]> wrote:
Hm, removing duplicates (as determined by a value of a specified document
field) from the results would be nice.
How would your addition affect performance, considering it has to check
the PQ for a previous value for every candidate hit?
Otis
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/ - Tag - Search - Share
----- Original Message ----
From: Peter Keegan <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Thursday, March 29, 2007 9:39:13 AM
Subject: FieldSortedHitQueue enhancement
This is request for an enhancement to FieldSortedHitQueue/PriorityQueue
that
would prevent duplicate documents from being inserted, or alternatively,
allow the application to prevent this (reason explained below). I can do
this today by making the 'lessThan' method public and checking the queue
before inserting like this:
if (hq.size() < maxSize) {
// doc will be inserted into queue - check for duplicate before
inserting
} else if (hq.size() > 0 && !hq.lessThan((ScoreDoc)fieldDoc,
(ScoreDoc)hq.top()) {
// doc will be inserted into queue - check for duplicate before
inserting
} else {
// doc will not be inserted - no check needed
}
However, this is just replicating existing code in
PriorityQueue->insert().
An alternative would be to have a method like:
public boolean wouldBeInserted(ScoreDoc doc)
// returns true if doc would be inserted, without inserting
The reason for this is that I have some queries that get expanded into
multiple searches and the resulting hits are OR'd together. The queries
contain 'terms' that are not seen by Lucene but are handled by a
HitCollector that uses external data for each document to evaluate hits.
The
results from the priority queue should contain no duplicate documents
(first
or last doc wins).
Do any of these suggestions seem reasonable?. So far, I've been able to
use
Lucene without any modifications, and hope to continue this way.
Peter
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]