Hello I just started using nutch. I followd the tutorial and eveything worked file with *Intranet Crawling* with site http://www.apache.org.
The problem is when I try any other sites (e.g http://www.bbc.com) it get Zero pages. below is the log run java in D:\j2sdk1.4.2_04 060123 081546 parsing file:/D:/nutch-0.7.1/conf/nutch-default.xml 060123 081547 parsing file:/D:/nutch-0.7.1/conf/crawl-tool.xml 060123 081547 parsing file:/D:/nutch-0.7.1/conf/nutch-site.xml 060123 081547 No FS indicated, using default:local 060123 081547 crawl started in: d:/crawled 060123 081547 rootUrlFile = urls 060123 081547 threads = 10 060123 081547 depth = 3 060123 081548 Created webdb at LocalFS,D:\crawled\db 060123 081548 Starting URL processing 060123 081548 Plugins: looking in: D:\nutch-0.7.1\plugins 060123 081548 parsing: D:\nutch-0.7.1\plugins\clustering-carrot2\plugin.xml 060123 081549 impl: point=org.apache.nutch.clustering.OnlineClusterer class= org.apache.nutch.clustering.carrot2.Clusterer 060123 081549 not including: D:\nutch-0.7.1\plugins\creativecommons 060123 081549 parsing: D:\nutch-0.7.1\plugins\index-basic\plugin.xml 060123 081549 impl: point=org.apache.nutch.indexer.IndexingFilter class= org.apache.nutch.indexer.basic.BasicIndexingFilter 060123 081549 parsing: D:\nutch-0.7.1\plugins\index-more\plugin.xml 060123 081549 impl: point=org.apache.nutch.indexer.IndexingFilter class= org.apache.nutch.indexer.more.MoreIndexingFilter 060123 081549 not including: D:\nutch-0.7.1\plugins\language-identifier 060123 081549 parsing: D:\nutch- 0.7.1\plugins\nutch-extensionpoints\plugin.xml 060123 081549 not including: D:\nutch-0.7.1\plugins\ontology 060123 081549 not including: D:\nutch-0.7.1\plugins\parse-ext 060123 081549 parsing: D:\nutch-0.7.1\plugins\parse-html\plugin.xml 060123 081549 impl: point=org.apache.nutch.parse.Parser class= org.apache.nutch.parse.html.HtmlParser 060123 081549 not including: D:\nutch-0.7.1\plugins\parse-js 060123 081549 not including: D:\nutch-0.7.1\plugins\parse-msword 060123 081549 not including: D:\nutch-0.7.1\plugins\parse-pdf 060123 081549 not including: D:\nutch-0.7.1\plugins\parse-rss 060123 081549 parsing: D:\nutch-0.7.1\plugins\parse-text\plugin.xml 060123 081549 impl: point=org.apache.nutch.parse.Parser class= org.apache.nutch.parse.text.TextParser 060123 081549 not including: D:\nutch-0.7.1\plugins\protocol-file 060123 081549 not including: D:\nutch-0.7.1\plugins\protocol-ftp 060123 081549 parsing: D:\nutch-0.7.1\plugins\protocol-http\plugin.xml 060123 081549 impl: point=org.apache.nutch.protocol.Protocol class= org.apache.nutch.protocol.http.Http 060123 081549 parsing: D:\nutch-0.7.1\plugins\protocol-httpclient\plugin.xml 060123 081549 impl: point=org.apache.nutch.protocol.Protocol class= org.apache.nutch.protocol.httpclient.Http 060123 081549 impl: point=org.apache.nutch.protocol.Protocol class= org.apache.nutch.protocol.httpclient.Http 060123 081549 parsing: D:\nutch-0.7.1\plugins\query-basic\plugin.xml 060123 081549 impl: point=org.apache.nutch.searcher.QueryFilter class= org.apache.nutch.searcher.basic.BasicQueryFilter 060123 081549 parsing: D:\nutch-0.7.1\plugins\query-more\plugin.xml 060123 081549 impl: point=org.apache.nutch.searcher.QueryFilter class= org.apache.nutch.searcher.more.TypeQueryFilter 060123 081549 impl: point=org.apache.nutch.searcher.QueryFilter class= org.apache.nutch.searcher.more.DateQueryFilter 060123 081549 parsing: D:\nutch-0.7.1\plugins\query-site\plugin.xml 060123 081549 impl: point=org.apache.nutch.searcher.QueryFilter class= org.apache.nutch.searcher.site.SiteQueryFilter 060123 081549 parsing: D:\nutch-0.7.1\plugins\query-url\plugin.xml 060123 081549 impl: point=org.apache.nutch.searcher.QueryFilter class= org.apache.nutch.searcher.url.URLQueryFilter 060123 081549 not including: D:\nutch-0.7.1\plugins\urlfilter-prefix 060123 081549 parsing: D:\nutch-0.7.1\plugins\urlfilter-regex\plugin.xml 060123 081549 impl: point=org.apache.nutch.net.URLFilter class= org.apache.nutch.net.RegexURLFilter 060123 081549 found resource crawl-urlfilter.txt at file:/D:/nutch-0.7.1 /conf/crawl-urlfilter.txt .060123 081549 Added 0 pages 060123 081549 FetchListTool started 060123 081550 Overall processing: Sorted 0 entries in 0.0 seconds. 060123 081550 Overall processing: Sorted NaN entries/second 060123 081550 FetchListTool completed 060123 081550 logging at FINE 060123 081550 logging at INFO 060123 081551 Updating D:\crawled\db 060123 081551 Updating for D:\crawled\segments\20060123081549 060123 081551 Finishing update 060123 081551 Update finished 060123 081551 FetchListTool started 060123 081552 Overall processing: Sorted 0 entries in 0.0 seconds. 060123 081552 Overall processing: Sorted NaN entries/second 060123 081552 FetchListTool completed 060123 081552 logging at INFO 060123 081553 Updating D:\crawled\db 060123 081553 Updating for D:\crawled\segments\20060123081551 060123 081553 Finishing update 060123 081553 Update finished 060123 081553 FetchListTool started 060123 081554 Overall processing: Sorted 0 entries in 0.0 seconds. 060123 081554 Overall processing: Sorted NaN entries/second 060123 081554 FetchListTool completed 060123 081554 logging at INFO 060123 081555 Updating D:\crawled\db 060123 081555 Updating for D:\crawled\segments\20060123081553 060123 081555 Finishing update 060123 081555 Update finished 060123 081555 Updating D:\crawled\segments from D:\crawled\db 060123 081555 reading D:\crawled\segments\20060123081549 060123 081555 reading D:\crawled\segments\20060123081551 060123 081555 reading D:\crawled\segments\20060123081553 060123 081555 Sorting pages by url... 060123 081555 Getting updated scores and anchors from db... 060123 081555 Sorting updates by segment... 060123 081555 Updating segments... 060123 081555 Done updating D:\crawled\segments from D:\crawled\db 060123 081555 indexing segment: D:\crawled\segments\20060123081549 060123 081555 * Opening segment 20060123081549 060123 081556 * Indexing segment 20060123081549 060123 081556 * Optimizing index... 060123 081556 * Moving index to NFS if needed... 060123 081556 DONE indexing segment 20060123081549: total 0 records in 0.05s (NaN rec/s). 060123 081556 done indexing 060123 081556 indexing segment: D:\crawled\segments\20060123081551 060123 081556 * Opening segment 20060123081551 060123 081556 * Indexing segment 20060123081551 060123 081556 * Optimizing index... 060123 081556 * Moving index to NFS if needed... 060123 081556 DONE indexing segment 20060123081551: total 0 records in 0.03s (NaN rec/s). 060123 081556 done indexing 060123 081556 indexing segment: D:\crawled\segments\20060123081553 060123 081556 * Opening segment 20060123081553 060123 081556 * Indexing segment 20060123081553 060123 081556 * Optimizing index... 060123 081556 * Moving index to NFS if needed... 060123 081556 DONE indexing segment 20060123081553: total 0 records in 0.04s (NaN rec/s). 060123 081556 done indexing 060123 081556 Reading url hashes... 060123 081556 Sorting url hashes... 060123 081556 Deleting url duplicates... 060123 081556 Deleted 0 url duplicates. 060123 081556 Reading content hashes... 060123 081556 Sorting content hashes... 060123 081556 Deleting content duplicates... 060123 081556 Deleted 0 content duplicates. 060123 081556 Duplicate deletion complete locally. Now returning to NFS... 060123 081556 DeleteDuplicates complete 060123 081556 Merging segment indexes... 060123 081556 crawl finished: d:/crawled