Am 02.12.2005 um 10:15 schrieb Andrzej Bialecki:

Yes, this is required to detect unmodified content. A small note: plain MD5Hash(byte[] content) is quite ineffective for many pages, e.g. pages with a counter, or with ads. It would be good to provide a framework for other implementations of "page equality" - for now perhaps we should just say that this value is a byte[], and not specifically an MD5Hash.

Some time ago I found a interesting mechanism that may would help us, it is called Locality-Sensitive Hashing (LSH). From my point of view this is would perfect solution to also remove a lot of spam pages, on my todo list I have a task to write a kind of proof of concept, but as we all - I was to busy with other things. You will find the paper behind the link below and I really would love to see this in the nutch sources and I would offer to work with other on such a solution.

http://dbpubs.stanford.edu:8090/pub/2000-23
or the pdf:
http://dbpubs.stanford.edu/pub/showDoc.Fulltext? lang=en&doc=2000-23&format=pdf&compression=&name=2000-23.pdf

Greetings,
Stefan

Reply via email to