I am trying to test the integrity of the crawl before using Tomcat.  I am using 
Nutch 1.0 on Ubuntu 9.04.  Could someone tell me as to why I am getting Total 
hits: 0.  

Regards, 
Sanjeev 



Here are the config files that I have configured.  
conf/crawl-urlfilter.txt 



        
        
        
        
        
        


+^http://([a-z0-9]*\.)*apache.org/
conf/nutch-site.xml 
  <name>http.agent.name</name>
  <value>buzz</value>
  <description>HTTP 'User-Agent' request header. MUST NOT be empty - 
    please set this to a single word uniquely related to your organization.
  </description>

Crawling the website: 
sanj...@tulip:~/nutch/nutch-1.0$ bin/nutch crawl urls -dir buzz -depth 1 -topN 1
crawl started in: buzz
rootUrlDir = urls
threads = 10
depth = 1
topN = 1
Injector: starting
Injector: crawlDb: buzz/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: buzz/segments/20090923194930
Generator: filtering: true
Generator: topN: 1
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: Your 'http.agent.name' value should be listed first in 
'http.robots.agents' property.
Fetcher: starting
Fetcher: segment: buzz/segments/20090923194930
Fetcher: threads: 10
QueueFeeder finished: total 1 records.
-finishing thread FetcherThread, activeThreads=1
fetching http://wiki.apache.org/tomcat/WordIndex
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
fetch of http://wiki.apache.org/tomcat/WordIndex failed with: 
java.net.SocketTimeoutException: Read timed out
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: buzz/crawldb
CrawlDb update: segments: [buzz/segments/20090923194930]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
LinkDb: starting
LinkDb: linkdb: buzz/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment: 
file:/home/sanjeev/nutch/nutch-1.0/buzz/segments/20090922223037
LinkDb: adding segment: 
file:/home/sanjeev/nutch/nutch-1.0/buzz/segments/20090923194930
LinkDb: adding segment: 
file:/home/sanjeev/nutch/nutch-1.0/buzz/segments/20090922213851
LinkDb: adding segment: 
file:/home/sanjeev/nutch/nutch-1.0/buzz/segments/20090922212629
LinkDb: adding segment: 
file:/home/sanjeev/nutch/nutch-1.0/buzz/segments/20090923152048
LinkDb: adding segment: 
file:/home/sanjeev/nutch/nutch-1.0/buzz/segments/20090922212647
LinkDb: adding segment: 
file:/home/sanjeev/nutch/nutch-1.0/buzz/segments/20090923191524
LinkDb: adding segment: 
file:/home/sanjeev/nutch/nutch-1.0/buzz/segments/20090923141246
LinkDb: adding segment: 
file:/home/sanjeev/nutch/nutch-1.0/buzz/segments/20090922213008
LinkDb: merging with existing linkdb: buzz/linkdb
LinkDb: done
Deleting old indexes: buzz/indexes
Deleting old merged index: buzz/index
Indexer: starting
Indexer: done
Dedup: starting
Dedup: adding indexes in: buzz/indexes
Dedup: done
merging indexes to: buzz/index
Adding file:/home/sanjeev/nutch/nutch-1.0/buzz/indexes/part-00000
done merging
crawl finished: buzz

Testing the Crawl:
sanj...@tulip:~/nutch/nutch-1.0$ bin/nutch org.apache.nutch.searcher.NutchBean 
apache 
Total hits: 0
sanj...@tulip:~/nutch/nutch-1.0$ 








      

Reply via email to