Re: crawl only option for Crawl.java and crawled content reader class

Cool Coder Fri, 23 Nov 2007 17:51:51 -0800

Definitely +1 from me.
  Infact I am already looking for a crawler to replace my own juck :( crawler. 
I am currently using my own crawler and lucene. 
   
  - RB


jian chen <[EMAIL PROTECTED]> wrote:
  Hi, Nutch Developers,

We would like to use Nutch as a crawler but does not need the indexes.
Looking at the Crawl.java, which is used for Intranet crawl, it is
just easy to comment out the index, dedup and merge code.

I see in this mailing list lot of people want to use Nutch only for
crawling, so, how about providing a command switch in Crawl.java to
support crawl only?

Another semi-related proposal. Since Nutch will be used only for
crawling purpose, we do need to extract the data out of Nutch after
the crawl. This is also a recurring question asked in this mailing
list. After doing quite some research, I found out that there are
couple of utility programs that could be used to extract the content
out of the crawled database. For example, the SequenceFileReader
and/or the MapFileReader.

So, how about providing a simple class that will just dump the data
out? I know that the SegmentReader and/or CrawlDbReader could be used,
but, they are sort of "heavy" as it is using the Map/Reduce algorithm
to do stuff. In quite some cases, I believe people just need to dump
the crawl data in the local file system into text file or database.
So, a tool without using Map/Reduce might be better?

I know Nutch is designed for internet crawl, but, there do exist a
need for intranet crawl and easily manipulate the data, which one
machine is good enough.

Thanks,

Jian


       
---------------------------------
Get easy, one-click access to your favorites.  Make Yahoo! your homepage.

Re: crawl only option for Crawl.java and crawled content reader class

Reply via email to