nutch crawl skipping links

Robert Dale Wed, 11 Jun 2008 07:35:47 -0700

with various nutch nightlies with the latest being 6/10...

I have a page that's 54k with 200 hrefs, wc counts 3k words.  In the
main body there are 74 hrefs that are all relative URLs.  The fetcher
never attempts to fetch certain URLs.  And this is just one specific
example, but there are random links all throughout the site that are
simply not fetched.


I start with a clean slate each time - rm -rf dir, no indexes, etc.
I run with "nutch crawl urls -dir dir -depth 12".

in crawl-tool.xml, nutch-site.xml:
http.content.limit = 2000000  (so file is not truncated)
indexer.max.tokens = 6000  (so index is not truncated, although it's
never fetched anyway)

I have it in both because some settings seem to be ignored.  For
instance, I can't seem to turn on verbose for either the fetcher or
http:
http.verbose = true
fetcher.verbose = true

Any ideas for what could be going on?  How can I begin to debug this?

Thanks,

-- 
Robert Dale

nutch crawl skipping links

Reply via email to