Re: Detection of index dublicates in Lucene

karl wettin Mon, 30 Jul 2007 05:51:37 -0700

30 jul 2007 kl. 14.43 skrev Grant Ingersoll:

I believe Nutch has a duplicate detection algorithm. I don't knowhow easy it would be to run independently on a Lucene index.

There have also been a bunch of near-duplicate ideas that have beenpresented on the forums before.

This is one of the threads: <http://www.nabble.com/Checking-for-duplicates-inside-index-tf1665494.html>



--
karl

-Grant

On Jul 29, 2007, at 2:18 AM, Dmitry wrote:
We trying to find are any implementation for Lucene - detectionindex duclicates.Assuming we have a set of documents and a document is a bunch ofwords. After we created indexec for the same document we need toknwo that all ideces will be uniq for specific document. (lexicalequivalency).
Can we have like implementation of algorithm has not determined aduplicate and another situation when algorithm has offered a falseduplicate. In this case we can find all dublicate indeces.
And the same Algorithm we can use to detect Document dublicates -in this case we save time and can get better performance not torun indexed services against this document.
Please any suggestions will be good.

Thanks,

DT,

www.ejinz.com

Search Engine News




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Detection of index dublicates in Lucene

Reply via email to