I'm familiar with Deduplication however I do not wish to remove my
duplicates and my needs are slightly different. I would like to mark the
first document with signature 'xyz' as unique but the next one as a
duplicate. This way I can filter out "duplicates" during searching using
a filter query but still return the original document.
The only thing I know of at the moment is to use field collapsing but I
tried the patch on 1.4.1 and it was terribly slow.
On 3/5/11 4:43 AM, Grant Ingersoll wrote:
See http://wiki.apache.org/solr/Deduplication. Should be fairly easy to pull
out if you are doing just Lucene.
On Mar 5, 2011, at 1:49 AM, Mark wrote:
Is there a way one could detect duplicates (say by using some unique hash of
certain fields) and marking a document as a duplicate but not remove it.
Here is an example:
Doc 1) This is my test
Doc 2) This is my test
Doc 3) Another test
Doc 4) This is my test
Doc 1 and 3 should be considered unique whereas 2 and 4 should be marked as
duplicates (of doc 1).
Can this be easily accomplished?
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
--------------------------
Grant Ingersoll
http://www.lucidimagination.com/
Search the Lucene ecosystem docs using Solr/Lucene:
http://www.lucidimagination.com/search
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org