Michael Plax wrote:
Question summery:
Q: How can I set up crawler in order to index all web site?

I'm trying to run crawl with command from tutorial

1. In urls file I have start page (index.html). 2. In the configuration file conf/crawl-urlfilter.txt domain was changed.
3. I run: $ bin/nutch crawl urls -dir crawledtottaly -depth 10 >& crawl.log
4. Crawling is finished
5. I run: bin/nutch readdb crawled/db -stats
   output:
  $ bin/nutch readdb crawledtottaly/db -stats
  run java in C:\Sun\AppServer\jdk
  060118 155526 parsing file:/C:/nutch/conf/nutch-default.xml
  060118 155526 parsing file:/C:/nutch/conf/nutch-site.xml
  060118 155526 No FS indicated, using default:local
  Stats for [EMAIL PROTECTED]
  -------------------------------
  Number of pages: 63
  Number of links: 3906
6. I get less pages than I have expected.

This is a common question, but there's not a common answer. The problem could be that urls are blocked by your url filter, or by http.max.delays, or something else.

What might help is if the fetcher and crawl db printed more detailed statistics. In particular, the fetcher could categorize failures and periodically print a list of failure counts by category. The crawl db updater could also list the number of urls that are filtered.

In the meantime, please examine the logs, particularly watching for errors while fetching.

Doug

Reply via email to