with various nutch nightlies with the latest being 6/10... I have a page that's 54k with 200 hrefs, wc counts 3k words. In the main body there are 74 hrefs that are all relative URLs. The fetcher never attempts to fetch certain URLs. And this is just one specific example, but there are random links all throughout the site that are simply not fetched.
I start with a clean slate each time - rm -rf dir, no indexes, etc. I run with "nutch crawl urls -dir dir -depth 12". in crawl-tool.xml, nutch-site.xml: http.content.limit = 2000000 (so file is not truncated) indexer.max.tokens = 6000 (so index is not truncated, although it's never fetched anyway) I have it in both because some settings seem to be ignored. For instance, I can't seem to turn on verbose for either the fetcher or http: http.verbose = true fetcher.verbose = true Any ideas for what could be going on? How can I begin to debug this? Thanks, -- Robert Dale
