17 okt 2006 kl. 18.55 skrev Andrzej Bialecki:
You need to create a fuzzy signature of the document, based on term
histogram or shingles - take a look a the Signature framework in
Nutch.
There is a substantial literature on this subject - go to Citeseer
and run a search for "near duplicate detection".
Interesting. I'll have to check this out a bit more some day(tm).
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]