Am 02.12.2005 um 10:15 schrieb Andrzej Bialecki:
Yes, this is required to detect unmodified content. A small note:
plain MD5Hash(byte[] content) is quite ineffective for many pages,
e.g. pages with a counter, or with ads. It would be good to provide
a framework for other implementations of "page equality" - for now
perhaps we should just say that this value is a byte[], and not
specifically an MD5Hash.
Some time ago I found a interesting mechanism that may would help us,
it is called Locality-Sensitive Hashing (LSH).
From my point of view this is would perfect solution to also remove
a lot of spam pages, on my todo list I have a task to write a kind
of proof of concept, but as we all - I was to busy with other things.
You will find the paper behind the link below and I really would love
to see this in the nutch sources and I would offer to work with other
on such a solution.
http://dbpubs.stanford.edu:8090/pub/2000-23
or the pdf:
http://dbpubs.stanford.edu/pub/showDoc.Fulltext?
lang=en&doc=2000-23&format=pdf&compression=&name=2000-23.pdf
Greetings,
Stefan