Re: Can't index some pages

Doug Cutting Thu, 19 Jan 2006 12:29:15 -0800

Michael Plax wrote:

Question summery:
Q: How can I set up crawler in order to index all web site?


I'm trying to run crawl with command from tutorial

1. In urls file I have start page (index.html).2. In the configuration file conf/crawl-urlfilter.txt domain was changed.

3. I run: $ bin/nutch crawl urls -dir crawledtottaly -depth 10 >& crawl.log
4. Crawling is finished
5. I run: bin/nutch readdb crawled/db -stats
   output:
  $ bin/nutch readdb crawledtottaly/db -stats
  run java in C:\Sun\AppServer\jdk
  060118 155526 parsing file:/C:/nutch/conf/nutch-default.xml
  060118 155526 parsing file:/C:/nutch/conf/nutch-site.xml
  060118 155526 No FS indicated, using default:local
  Stats for [EMAIL PROTECTED]
  -------------------------------
  Number of pages: 63
  Number of links: 3906
6. I get less pages than I have expected.

This is a common question, but there's not a common answer. The problemcould be that urls are blocked by your url filter, or byhttp.max.delays, or something else.

What might help is if the fetcher and crawl db printed more detailedstatistics. In particular, the fetcher could categorize failures andperiodically print a list of failure counts by category. The crawl dbupdater could also list the number of urls that are filtered.

In the meantime, please examine the logs, particularly watching forerrors while fetching.


Doug

Re: Can't index some pages

Reply via email to