[Nutch-general] Re: Can't index some pages

Doug Cutting Thu, 19 Jan 2006 12:30:09 -0800

Michael Plax wrote:

Question summery:
Q: How can I set up crawler in order to index all web site?


I'm trying to run crawl with command from tutorial

1. In urls file I have start page (index.html).2. In the configuration file conf/crawl-urlfilter.txt domain was changed.

3. I run: $ bin/nutch crawl urls -dir crawledtottaly -depth 10 >& crawl.log
4. Crawling is finished
5. I run: bin/nutch readdb crawled/db -stats
   output:
  $ bin/nutch readdb crawledtottaly/db -stats
  run java in C:\Sun\AppServer\jdk
  060118 155526 parsing file:/C:/nutch/conf/nutch-default.xml
  060118 155526 parsing file:/C:/nutch/conf/nutch-site.xml
  060118 155526 No FS indicated, using default:local
  Stats for [EMAIL PROTECTED]
  -------------------------------
  Number of pages: 63
  Number of links: 3906
6. I get less pages than I have expected.

This is a common question, but there's not a common answer. The problemcould be that urls are blocked by your url filter, or byhttp.max.delays, or something else.

What might help is if the fetcher and crawl db printed more detailedstatistics. In particular, the fetcher could categorize failures andperiodically print a list of failure counts by category. The crawl dbupdater could also list the number of urls that are filtered.

In the meantime, please examine the logs, particularly watching forerrors while fetching.


Doug


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] Re: Can't index some pages

Reply via email to