Is there a way one could detect duplicates (say by using some unique
hash of certain fields) and marking a document as a duplicate but not
remove it.
Here is an example:
Doc 1) This is my test
Doc 2) This is my test
Doc 3) Another test
Doc 4) This is my test
Doc 1 and 3 should be considered unique whereas 2 and 4 should be marked
as duplicates (of doc 1).
Can this be easily accomplished?
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org