Hi folks, I have a lot of questions about QBE in Nutch! So first does Nutch support QBE through the (implemented?) More Like This function? If so, can anyone explain briefly the algorithm to do that, how the similarity between WebPages is computed?
The one used by Google is described in this paper http://citeseer.ist.psu.edu/dean99finding.html but it only shows the 31 similar pages (31? Still don’t have an authoritative explanation about that number: probably for the sake of relevant concise answer instead of ranking the thousands of query results) for well known sites which are supposed to have a non-obscure content (sites like nytimes.com, cnn.com, google.com) rather than personal web pages or other less popular web pages. Is there any benchmark testing a state of the art WebPages similarity functions? Best, Nizar.
