Hi Andrej!

I'm taking a look to fuzzy signatures for near duplicate detection and and I have seen your TextProfileSignature. The question is: If I index the documents with their text signature, is there a way to filter near duplicates at search time without comparing each document with all other?

Thanks
Beto

Andrzej Bialecki wrote:
karl wettin wrote:

17 okt 2006 kl. 17.54 skrev Find Me:

How to eliminate near duplicates from the index?

I would probably try to measure the Ecludian distance between all documents, computed on terms and their positions. Or perhaps use standard deviation to find the distribution of terms in a document. One would based on the output from that try to find a threashold. Either way it will consume lots of CPU.


There are better ways to achieve this. You need to create a fuzzy signature of the document, based on term histogram or shingles - take a look a the Signature framework in Nutch.

There is a substantial literature on this subject - go to Citeseer and run a search for "near duplicate detection".


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to