RE: using nutch just for crawling, not indexing?

Chirag Chaman Mon, 02 May 2005 14:10:08 -0700

Jason,

The data is technically not stored in WebDb -- just the links are (and the
related information).
The pages (content) are stored within the segment.


Nutch is nothing more than a "search engine" specific implementation of
lucene (with added hooks). It uses lucene for indexing out of the box. Thus
you could in theory use the entire nutch process and then do your search
using dotlucene on the segments created by nutch (repeat "in theory
...you'll need to do some hacking to make it work). Make sure the index file
versions are the same/compatible.

Look at the WebDB API, and you'll get a better picture of whats available.
You should also look at the SegmentReader -- as this is what you'll need to
read the actual content of each page.

CC-

--------------------------------------------
Filangy, Inc.
Interested in Improving Search? Join our Team!
http://filangy.com/jointheteam.jsp 

-----Original Message-----
From: Jason Manfield [mailto:[EMAIL PROTECTED] 
Sent: Monday, May 02, 2005 3:32 PM
To: Nutch User
Subject: using nutch just for crawling, not indexing?

We would like to use nutch just for crawling, and then index the crawled
database into our proprietory datastore/index. How do we go about this? I
see that nutch is a shell script, so it is possible to just crawl. Once it
crawls, I suppose the crawled data is dumped into webdb. Are there exposed
APIs to extract the data from webdb? 
 
One more catch -- our company is a .NET shop :((, so we would like to use C#
to read the data of the fetched/crawled pages for further indexing.
 
Ideas/suggestions?
 
Any plans to have nutch for .NET (like dotLucene)?

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around
http://mail.yahoo.com

RE: using nutch just for crawling, not indexing?

Reply via email to