I am new to nutch and am trying to see if we can use it for web search
functionality.
I am running the site on my local box on a Weblogic server. I am using
nutch 0.8.1 on Windows XP using cygwin.
I created a "urls" directory and then created a file called "frontend" in
that directory
The local url that I have specified in that file is
http://172.16.10.99:7001/frontend/ <http://172.16.10.99:7001/frontend/>
This is the only line in that file.
I have also changed the crawl-urlfilter file as follows
# accept hosts in MY.DOMAIN.NAME
+^http://172.16.10.99:7001/frontend/
The command I am executing is
bin/nutch crawl urls -dir _crawloutput -depth 3 -topN 50
The crawl output I get is as follows:
crawl started in: _crawloutput
rootUrlDir = urls
threads = 10
depth = 3
topN = 50
Injector: starting
Injector: crawlDb: _crawloutput/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: done
Generator: starting
Generator: segment: _crawloutput/segments/20060929101916
Generator: Selecting best-scoring urls due for fetch.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: _crawloutput/segments/20060929101916
Fetcher: threads: 10
fetching http://172.16.10.99:7001/frontend/
<http://172.16.10.99:7001/frontend/>
fetch of http://172.16.10.99:7001/frontend/
<http://172.16.10.99:7001/frontend/> failed with:
java.lang.NullPointerException
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: _crawloutput/crawldb
CrawlDb update: segment: _crawloutput/segments/20060929101916
CrawlDb update: Merging segment data into db.
CrawlDb update: done
Generator: starting
Generator: segment: _crawloutput/segments/20060929101924
Generator: Selecting best-scoring urls due for fetch.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: _crawloutput/segments/20060929101924
Fetcher: threads: 10
fetching http://172.16.10.99:7001/frontend/
<http://172.16.10.99:7001/frontend/>
fetch of http://172.16.10.99:7001/frontend/
<http://172.16.10.99:7001/frontend/> failed with:
java.lang.NullPointerException
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: _crawloutput/crawldb
CrawlDb update: segment: _crawloutput/segments/20060929101924
CrawlDb update: Merging segment data into db.
CrawlDb update: done
Generator: starting
Generator: segment: _crawloutput/segments/20060929101932
Generator: Selecting best-scoring urls due for fetch.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: _crawloutput/segments/20060929101932
Fetcher: threads: 10
fetching http://172.16.10.99:7001/frontend/
<http://172.16.10.99:7001/frontend/>
fetch of http://172.16.10.99:7001/frontend/
<http://172.16.10.99:7001/frontend/> failed with:
java.lang.NullPointerException
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: _crawloutput/crawldb
CrawlDb update: segment: _crawloutput/segments/20060929101932
CrawlDb update: Merging segment data into db.
CrawlDb update: done
LinkDb: starting
LinkDb: linkdb: _crawloutput/linkdb
LinkDb: adding segment: _crawloutput/segments/20060929101916
LinkDb: adding segment: _crawloutput/segments/20060929101924
LinkDb: adding segment: _crawloutput/segments/20060929101932
LinkDb: done
Indexer: starting
Indexer: linkdb: _crawloutput/linkdb
Indexer: adding segment: _crawloutput/segments/20060929101916
Indexer: adding segment: _crawloutput/segments/20060929101924
Indexer: adding segment: _crawloutput/segments/20060929101932
Optimizing index.
Indexer: done
Dedup: starting
Dedup: adding indexes in: _crawloutput/indexes
Dedup: done
Adding _crawloutput/indexes/part-00000
crawl finished: _crawloutput
I am not sure what I am doing wrong. Can someone help?
Thanks
Anand Narayan
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general