crawl only option for Crawl.java and crawled content reader class

jian chen Fri, 23 Nov 2007 17:19:55 -0800

Hi, Nutch Developers,

We would like to use Nutch as a crawler but does not need the indexes.
Looking at the Crawl.java, which is used for Intranet crawl, it is
just easy to comment out the index, dedup and merge code.


I see in this mailing list lot of people want to use Nutch only for
crawling, so, how about providing a command switch in Crawl.java to
support crawl only?

Another semi-related proposal. Since Nutch will be used only for
crawling purpose, we do need to extract the data out of Nutch after
the crawl. This is also a recurring question asked in this mailing
list. After doing quite some research, I found out that there are
couple of utility programs that could be used to extract the content
out of the crawled database. For example, the SequenceFileReader
and/or the MapFileReader.

So, how about providing a simple class that will just dump the data
out? I know that the SegmentReader and/or CrawlDbReader could be used,
but, they are sort of "heavy" as it is using the Map/Reduce algorithm
to do stuff. In quite some cases, I believe people just need to dump
the crawl data in the local file system into text file or database.
So, a tool without using Map/Reduce might be better?

I know Nutch is designed for internet crawl, but, there do exist a
need for intranet crawl and easily manipulate the data, which one
machine is good enough.

Thanks,

Jian

crawl only option for Crawl.java and crawled content reader class

Reply via email to