Michael Plax wrote:
Question summery:
Q: How can I set up crawler in order to index all web site?
I'm trying to run crawl with command from tutorial
1. In urls file I have start page (index.html).
2. In the configuration file conf/crawl-urlfilter.txt domain was changed.
3. I run: $ bin/nutch crawl urls -dir crawledtottaly -depth 10 >& crawl.log
4. Crawling is finished
5. I run: bin/nutch readdb crawled/db -stats
output:
$ bin/nutch readdb crawledtottaly/db -stats
run java in C:\Sun\AppServer\jdk
060118 155526 parsing file:/C:/nutch/conf/nutch-default.xml
060118 155526 parsing file:/C:/nutch/conf/nutch-site.xml
060118 155526 No FS indicated, using default:local
Stats for [EMAIL PROTECTED]
-------------------------------
Number of pages: 63
Number of links: 3906
6. I get less pages than I have expected.
This is a common question, but there's not a common answer. The problem
could be that urls are blocked by your url filter, or by
http.max.delays, or something else.
What might help is if the fetcher and crawl db printed more detailed
statistics. In particular, the fetcher could categorize failures and
periodically print a list of failure counts by category. The crawl db
updater could also list the number of urls that are filtered.
In the meantime, please examine the logs, particularly watching for
errors while fetching.
Doug