Jason, The data is technically not stored in WebDb -- just the links are (and the related information). The pages (content) are stored within the segment.
Nutch is nothing more than a "search engine" specific implementation of lucene (with added hooks). It uses lucene for indexing out of the box. Thus you could in theory use the entire nutch process and then do your search using dotlucene on the segments created by nutch (repeat "in theory ...you'll need to do some hacking to make it work). Make sure the index file versions are the same/compatible. Look at the WebDB API, and you'll get a better picture of whats available. You should also look at the SegmentReader -- as this is what you'll need to read the actual content of each page. CC- -------------------------------------------- Filangy, Inc. Interested in Improving Search? Join our Team! http://filangy.com/jointheteam.jsp -----Original Message----- From: Jason Manfield [mailto:[EMAIL PROTECTED] Sent: Monday, May 02, 2005 3:32 PM To: Nutch User Subject: using nutch just for crawling, not indexing? We would like to use nutch just for crawling, and then index the crawled database into our proprietory datastore/index. How do we go about this? I see that nutch is a shell script, so it is possible to just crawl. Once it crawls, I suppose the crawled data is dumped into webdb. Are there exposed APIs to extract the data from webdb? One more catch -- our company is a .NET shop :((, so we would like to use C# to read the data of the fetched/crawled pages for further indexing. Ideas/suggestions? Any plans to have nutch for .NET (like dotLucene)? __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
