Hi, Nutch Developers, We would like to use Nutch as a crawler but does not need the indexes. Looking at the Crawl.java, which is used for Intranet crawl, it is just easy to comment out the index, dedup and merge code.
I see in this mailing list lot of people want to use Nutch only for crawling, so, how about providing a command switch in Crawl.java to support crawl only? Another semi-related proposal. Since Nutch will be used only for crawling purpose, we do need to extract the data out of Nutch after the crawl. This is also a recurring question asked in this mailing list. After doing quite some research, I found out that there are couple of utility programs that could be used to extract the content out of the crawled database. For example, the SequenceFileReader and/or the MapFileReader. So, how about providing a simple class that will just dump the data out? I know that the SegmentReader and/or CrawlDbReader could be used, but, they are sort of "heavy" as it is using the Map/Reduce algorithm to do stuff. In quite some cases, I believe people just need to dump the crawl data in the local file system into text file or database. So, a tool without using Map/Reduce might be better? I know Nutch is designed for internet crawl, but, there do exist a need for intranet crawl and easily manipulate the data, which one machine is good enough. Thanks, Jian
