Hi Ian,

I am very new to Nutch as well, so take this for what it is.

First, I have been putting all my regex filters into the crawl-urlfilters file and just leaving them there even if I am only crawling a subset of sites, and I don't have any problems. Of course, if you have complicated regexes that might clash with each other then you would have to add/remove them.

Second, have you looked at archive.org's Heritrix (http://crawler.archive.org/)? It does a lot of what you are talking about (sans the database). I am not suggesting you use it, as the crawl output is probably not what you want, but the program does an amazing amount of things that might give you some (more) ideas. It might be worth your time just to check it out.

-lucas

On May 19, 2005, at 6:27 AM, Ian Reardon wrote:

Would this be usefull to anyone.  I am new to nutch but from what I've
seen so far it would make my life easier.

I was thinking about having a small web interface that would run the
crawler.  This would be more for intranet crawling rather then whole
web.  It would basicly have a database of what URL's you wanted to
crawl along with properties associated with that site (Regex, depth,
delays etc).   It would then monitor the log file and keep track of
number of sites crawler, # errors, # success and what not and display
it in a nice layout/status page.

I find that I'll be adding and deleting regex from the urlfilter files
all the time based on the sites i'm crawling.  It would be nice to
have all that organized in a database so if say I ever want to recrawl
a site, I just hit 2 buttons and all the regex load up, the right
starting URL and it just starts to crawl.

As I am new maybe this wouldn't be usefull and there is already
processes that take care of all this?  Or maybe i'm just not using
nutch correctly?  If this would help anyone I am thinking about
writing it and i'll make it availble to whoever wants it.



Reply via email to