17 okt 2006 kl. 18.55 skrev Andrzej Bialecki:
You need to create a fuzzy signature of the document, based on term histogram or shingles - take a look a the Signature framework in Nutch.

There is a substantial literature on this subject - go to Citeseer and run a search for "near duplicate detection".

Interesting. I'll have to check this out a bit more some day(tm).

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to