Michael Plax wrote:
Question summery:
Q: How can I set up crawler in order to index all web site?

I'm trying to run crawl with command from tutorial

1. In urls file I have start page (index.html). 2. In the configuration file conf/crawl-urlfilter.txt domain was changed.
3. I run: $ bin/nutch crawl urls -dir crawledtottaly -depth 10 >& crawl.log
4. Crawling is finished
5. I run: bin/nutch readdb crawled/db -stats
   output:
  $ bin/nutch readdb crawledtottaly/db -stats
  run java in C:\Sun\AppServer\jdk
  060118 155526 parsing file:/C:/nutch/conf/nutch-default.xml
  060118 155526 parsing file:/C:/nutch/conf/nutch-site.xml
  060118 155526 No FS indicated, using default:local
  Stats for [EMAIL PROTECTED]
  -------------------------------
  Number of pages: 63
  Number of links: 3906
6. I get less pages than I have expected.

This is a common question, but there's not a common answer. The problem could be that urls are blocked by your url filter, or by http.max.delays, or something else.

What might help is if the fetcher and crawl db printed more detailed statistics. In particular, the fetcher could categorize failures and periodically print a list of failure counts by category. The crawl db updater could also list the number of urls that are filtered.

In the meantime, please examine the logs, particularly watching for errors while fetching.

Doug


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to