EM wrote:
202443 Pages consumed: 130000 (at index 130000). Links fetched: 233386.
202443 Suspicious outlink count = 30442 for [http://www.dmoz.org/].
202444 Pages consumed: 135000 (at index 135000). Links fetched: 272315.

If there is maxoutlinks already specified in the xml config, why does nutch bother counting anything over that again?

During PageRank computation nutch retrieves all links from given page
by MD5. If we have many pages with the same MD5 it can retrieve all outlinks from these pages - I saw some "bot traps" that had big site structures that had exactly the same MD5 (once I had over a milion of identical pages in my index with different urls from the same host).So in this case we are getting the union af all such outlinks. In some situations having a big number of outlinks is not a problem (like in your case - all pages injected from dmoz are outlinks from dmoz) - but usually it indicates some problems in your index or at least a reason to look at it. So I have decided to print a warning in this case so one can
have a look at such site.
Regards
Piotr

Reply via email to