I am trying to test the integrity of the crawl before using Tomcat. I am using Nutch 1.0 on Ubuntu 9.04. Could someone tell me as to why I am getting Total hits: 0.
Regards, Sanjeev Here are the config files that I have configured. conf/crawl-urlfilter.txt +^http://([a-z0-9]*\.)*apache.org/ conf/nutch-site.xml <name>http.agent.name</name> <value>buzz</value> <description>HTTP 'User-Agent' request header. MUST NOT be empty - please set this to a single word uniquely related to your organization. </description> Crawling the website: sanj...@tulip:~/nutch/nutch-1.0$ bin/nutch crawl urls -dir buzz -depth 1 -topN 1 crawl started in: buzz rootUrlDir = urls threads = 10 depth = 1 topN = 1 Injector: starting Injector: crawlDb: buzz/crawldb Injector: urlDir: urls Injector: Converting injected urls to crawl db entries. Injector: Merging injected urls into crawl db. Injector: done Generator: Selecting best-scoring urls due for fetch. Generator: starting Generator: segment: buzz/segments/20090923194930 Generator: filtering: true Generator: topN: 1 Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls by host, for politeness. Generator: done. Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property. Fetcher: starting Fetcher: segment: buzz/segments/20090923194930 Fetcher: threads: 10 QueueFeeder finished: total 1 records. -finishing thread FetcherThread, activeThreads=1 fetching http://wiki.apache.org/tomcat/WordIndex -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=1 -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 fetch of http://wiki.apache.org/tomcat/WordIndex failed with: java.net.SocketTimeoutException: Read timed out -finishing thread FetcherThread, activeThreads=0 -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=0 Fetcher: done CrawlDb update: starting CrawlDb update: db: buzz/crawldb CrawlDb update: segments: [buzz/segments/20090923194930] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: true CrawlDb update: URL filtering: true CrawlDb update: Merging segment data into db. CrawlDb update: done LinkDb: starting LinkDb: linkdb: buzz/linkdb LinkDb: URL normalize: true LinkDb: URL filter: true LinkDb: adding segment: file:/home/sanjeev/nutch/nutch-1.0/buzz/segments/20090922223037 LinkDb: adding segment: file:/home/sanjeev/nutch/nutch-1.0/buzz/segments/20090923194930 LinkDb: adding segment: file:/home/sanjeev/nutch/nutch-1.0/buzz/segments/20090922213851 LinkDb: adding segment: file:/home/sanjeev/nutch/nutch-1.0/buzz/segments/20090922212629 LinkDb: adding segment: file:/home/sanjeev/nutch/nutch-1.0/buzz/segments/20090923152048 LinkDb: adding segment: file:/home/sanjeev/nutch/nutch-1.0/buzz/segments/20090922212647 LinkDb: adding segment: file:/home/sanjeev/nutch/nutch-1.0/buzz/segments/20090923191524 LinkDb: adding segment: file:/home/sanjeev/nutch/nutch-1.0/buzz/segments/20090923141246 LinkDb: adding segment: file:/home/sanjeev/nutch/nutch-1.0/buzz/segments/20090922213008 LinkDb: merging with existing linkdb: buzz/linkdb LinkDb: done Deleting old indexes: buzz/indexes Deleting old merged index: buzz/index Indexer: starting Indexer: done Dedup: starting Dedup: adding indexes in: buzz/indexes Dedup: done merging indexes to: buzz/index Adding file:/home/sanjeev/nutch/nutch-1.0/buzz/indexes/part-00000 done merging crawl finished: buzz Testing the Crawl: sanj...@tulip:~/nutch/nutch-1.0$ bin/nutch org.apache.nutch.searcher.NutchBean apache Total hits: 0 sanj...@tulip:~/nutch/nutch-1.0$