Hi, Cool Coder, I also wrote a crawler based on a previous version of Nutch, version 0.7 for intranet crawl. In my crawler, I used MySql database as the backend storage for urls and their metadata and content.
But, since maintaining my own crawler almost becomes a full-time job itself, plus, it lacks a lot of features compared to Nutch latest version, so, I am kind of motivated to come back to Nutch and thinking about using Nutch instead of my own. However, I have a different need for indexing, so, I don't think I will use the Nutch indexing and search code. I would rather stick to my own. Like you, I use a customized version of Lucene to index and provide search. I just hope to make Nutch easier to use for this kind of crawl-only scenarios. I am willing to contribute code/documents in regard to this, if necessary. Cheers, Jian On Nov 23, 2007 5:51 PM, Cool Coder <[EMAIL PROTECTED]> wrote: > Definitely +1 from me. > Infact I am already looking for a crawler to replace my own juck :( > crawler. I am currently using my own crawler and lucene. > > - RB > > > jian chen <[EMAIL PROTECTED]> wrote: > Hi, Nutch Developers, > > We would like to use Nutch as a crawler but does not need the indexes. > Looking at the Crawl.java, which is used for Intranet crawl, it is > just easy to comment out the index, dedup and merge code. > > I see in this mailing list lot of people want to use Nutch only for > crawling, so, how about providing a command switch in Crawl.java to > support crawl only? > > Another semi-related proposal. Since Nutch will be used only for > crawling purpose, we do need to extract the data out of Nutch after > the crawl. This is also a recurring question asked in this mailing > list. After doing quite some research, I found out that there are > couple of utility programs that could be used to extract the content > out of the crawled database. For example, the SequenceFileReader > and/or the MapFileReader. > > So, how about providing a simple class that will just dump the data > out? I know that the SegmentReader and/or CrawlDbReader could be used, > but, they are sort of "heavy" as it is using the Map/Reduce algorithm > to do stuff. In quite some cases, I believe people just need to dump > the crawl data in the local file system into text file or database. > So, a tool without using Map/Reduce might be better? > > I know Nutch is designed for internet crawl, but, there do exist a > need for intranet crawl and easily manipulate the data, which one > machine is good enough. > > Thanks, > > Jian > > > > --------------------------------- > Get easy, one-click access to your favorites. Make Yahoo! your homepage.
