Hi Ian,
I am very new to Nutch as well, so take this for what it is.
First, I have been putting all my regex filters into the crawl-urlfilters file and just leaving them there even if I am only crawling a subset of sites, and I don't have any problems. Of course, if you have complicated regexes that might clash with each other then you would have to add/remove them.
Second, have you looked at archive.org's Heritrix (http://crawler.archive.org/)? It does a lot of what you are talking about (sans the database). I am not suggesting you use it, as the crawl output is probably not what you want, but the program does an amazing amount of things that might give you some (more) ideas. It might be worth your time just to check it out.
-lucas
On May 19, 2005, at 6:27 AM, Ian Reardon wrote:
Would this be usefull to anyone. I am new to nutch but from what I've seen so far it would make my life easier.
I was thinking about having a small web interface that would run the crawler. This would be more for intranet crawling rather then whole web. It would basicly have a database of what URL's you wanted to crawl along with properties associated with that site (Regex, depth, delays etc). It would then monitor the log file and keep track of number of sites crawler, # errors, # success and what not and display it in a nice layout/status page.
I find that I'll be adding and deleting regex from the urlfilter files all the time based on the sites i'm crawling. It would be nice to have all that organized in a database so if say I ever want to recrawl a site, I just hit 2 buttons and all the regex load up, the right starting URL and it just starts to crawl.
As I am new maybe this wouldn't be usefull and there is already processes that take care of all this? Or maybe i'm just not using nutch correctly? If this would help anyone I am thinking about writing it and i'll make it availble to whoever wants it.
