Would this be usefull to anyone. I am new to nutch but from what I've seen so far it would make my life easier.
I was thinking about having a small web interface that would run the crawler. This would be more for intranet crawling rather then whole web. It would basicly have a database of what URL's you wanted to crawl along with properties associated with that site (Regex, depth, delays etc). It would then monitor the log file and keep track of number of sites crawler, # errors, # success and what not and display it in a nice layout/status page. I find that I'll be adding and deleting regex from the urlfilter files all the time based on the sites i'm crawling. It would be nice to have all that organized in a database so if say I ever want to recrawl a site, I just hit 2 buttons and all the regex load up, the right starting URL and it just starts to crawl. As I am new maybe this wouldn't be usefull and there is already processes that take care of all this? Or maybe i'm just not using nutch correctly? If this would help anyone I am thinking about writing it and i'll make it availble to whoever wants it.
