Hi. I am building (yet another) crawler, parsing and indexing the html files crawled with Lucene. Then I came to think about it. Stupido! why aren't you using nutch instead!
My use case is something like this. 100-1000 domains with average depth of 3 to 5 I think. If I miss some pages it is not the end of the world so a tradeoff between depth and crawl speed is taken. All urls must be crawled at least once a day and be crontabbed. I would like to have one lucene dir which is optimized after each reindexing not one dir per crawl so I need to create something like the recrawl script which is published on the Wiki. I would prefer to search the content myself by creating an IndexSearcher, this is because I already index a whole lot of RSS feeds so I would like to do a "MultiIndex" search, think that will be hard to do without doing it yourself. I noticed the WAR file but I would prefer too create the templates myself. Anyone have a good pattern regarding this ? Kindly //Marcus Herou -- Marcus Herou Solution Architect & Core Java developer Tailsweep AB +46702561312 [EMAIL PROTECTED] http://www.tailsweep.com
