Re: [Nutch-general] Crawl on local site not working

Dima Mazmanov Fri, 29 Sep 2006 07:29:36 -0700

Hi,Anand.


You wrote 29 сентября 2006 г., 18:22:39:

> I am new to nutch and am trying to see if we can use it for web search
> functionality.
> I am running the site on my local box on a Weblogic server.  I am using
> nutch 0.8.1 on Windows XP using cygwin.

> I created a "urls" directory and then created a file called "frontend" in
> that directory
> The local url that I have specified in that file is
> http://172.16.10.99:7001/frontend/
> <http://172.16.10.99:7001/frontend/> 
> This is the only line in that file.

> I have also changed the crawl-urlfilter file as follows
> # accept hosts in MY.DOMAIN.NAME

> +^http://172.16.10.99:7001/frontend/
this is baaadd.
remove this string from file.
and then copy urls from "frented" into crawl-urlfilter file directly
after this   # accept hosts in MY.DOMAIN.NAME .

remove +. from end of file
and write -.


> The command I am executing is 
> bin/nutch crawl urls -dir _crawloutput -depth 3 -topN 50

> The crawl output I get is as follows:
> crawl started in: _crawloutput
> rootUrlDir = urls
> threads = 10
> depth = 3
> topN = 50
> Injector: starting
> Injector: crawlDb: _crawloutput/crawldb
> Injector: urlDir: urls
> Injector: Converting injected urls to crawl db entries.
> Injector: Merging injected urls into crawl db.
> Injector: done
> Generator: starting
> Generator: segment: _crawloutput/segments/20060929101916
> Generator: Selecting best-scoring urls due for fetch.
> Generator: Partitioning selected urls by host, for politeness.
> Generator: done.
> Fetcher: starting
> Fetcher: segment: _crawloutput/segments/20060929101916
> Fetcher: threads: 10
> fetching http://172.16.10.99:7001/frontend/
> <http://172.16.10.99:7001/frontend/> 
> fetch of http://172.16.10.99:7001/frontend/
> <http://172.16.10.99:7001/frontend/>  failed with:
> java.lang.NullPointerException
> Fetcher: done
> CrawlDb update: starting
> CrawlDb update: db: _crawloutput/crawldb
> CrawlDb update: segment: _crawloutput/segments/20060929101916
> CrawlDb update: Merging segment data into db.
> CrawlDb update: done
> Generator: starting
> Generator: segment: _crawloutput/segments/20060929101924
> Generator: Selecting best-scoring urls due for fetch.
> Generator: Partitioning selected urls by host, for politeness.
> Generator: done.
> Fetcher: starting
> Fetcher: segment: _crawloutput/segments/20060929101924
> Fetcher: threads: 10
> fetching http://172.16.10.99:7001/frontend/
> <http://172.16.10.99:7001/frontend/> 
> fetch of http://172.16.10.99:7001/frontend/
> <http://172.16.10.99:7001/frontend/>  failed with:
> java.lang.NullPointerException
> Fetcher: done
> CrawlDb update: starting
> CrawlDb update: db: _crawloutput/crawldb
> CrawlDb update: segment: _crawloutput/segments/20060929101924
> CrawlDb update: Merging segment data into db.
> CrawlDb update: done
> Generator: starting
> Generator: segment: _crawloutput/segments/20060929101932
> Generator: Selecting best-scoring urls due for fetch.
> Generator: Partitioning selected urls by host, for politeness.
> Generator: done.
> Fetcher: starting
> Fetcher: segment: _crawloutput/segments/20060929101932
> Fetcher: threads: 10
> fetching http://172.16.10.99:7001/frontend/
> <http://172.16.10.99:7001/frontend/> 
> fetch of http://172.16.10.99:7001/frontend/
> <http://172.16.10.99:7001/frontend/>  failed with:
> java.lang.NullPointerException
> Fetcher: done
> CrawlDb update: starting
> CrawlDb update: db: _crawloutput/crawldb
> CrawlDb update: segment: _crawloutput/segments/20060929101932
> CrawlDb update: Merging segment data into db.
> CrawlDb update: done
> LinkDb: starting
> LinkDb: linkdb: _crawloutput/linkdb
> LinkDb: adding segment: _crawloutput/segments/20060929101916
> LinkDb: adding segment: _crawloutput/segments/20060929101924
> LinkDb: adding segment: _crawloutput/segments/20060929101932
> LinkDb: done
> Indexer: starting
> Indexer: linkdb: _crawloutput/linkdb
> Indexer: adding segment: _crawloutput/segments/20060929101916
> Indexer: adding segment: _crawloutput/segments/20060929101924
> Indexer: adding segment: _crawloutput/segments/20060929101932
> Optimizing index.
> Indexer: done
> Dedup: starting
> Dedup: adding indexes in: _crawloutput/indexes
> Dedup: done
> Adding _crawloutput/indexes/part-00000
> crawl finished: _crawloutput

> I am not sure what I am doing wrong. Can someone help?

> Thanks
> Anand Narayan


> __________ NOD32 1.1783 (20060929) Information __________

> This message was checked by NOD32 antivirus system.
> http://www.eset.com




-- 
Regards,
 Dima                          mailto:[EMAIL PROTECTED]


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Crawl on local site not working

Reply via email to