Re: Integration of Nutch

Renaud Richardet Mon, 06 Aug 2007 18:40:51 -0700

hi Marcus,

Hi.


I am building (yet another) crawler, parsing and indexing the html files
crawled with Lucene. Then I came to think about it. Stupido! why aren't you
using nutch instead!

My use case is something like this.

100-1000 domains with average depth of 3 to 5 I think. If I miss some pages
it is not the end of the world so a tradeoff between depth and crawl speed
is taken.
All urls must be crawled at least once a day and be crontabbed.

I would like to have one lucene dir which is optimized after each reindexing
not one dir per crawl so I need to create something like the recrawl script
which is published on the Wiki.

Not sure I understand: why don't you just throw away the old index onceyou have successfully created the new one (since you have to re-crawlthe whole content daily)?

I would prefer to search the content myself by creating an IndexSearcher,
this is because I already index a whole lot of RSS feeds so I would like to
do a "MultiIndex" search, think that will be hard to do without doing it
yourself.

Or you could index the feeds with Nutch, too. There's a plugin for RSS...

I noticed the WAR file but I would prefer too create the templates myself.

Actually, the WAR is just a started, you will have to implement yourlayout anyway in the jsp's.


HTH,
Renaud

Re: Integration of Nutch

Reply via email to