I've been using a perl based focussed web crawler with a MySQL back end, but am now looking at Nutch instead. It seems like a few other people have done something similar. I'm wondering whether we could pool our resources and work together on this?
It seems to me that we would be building a few extra plugins. Here is how I see a focussed nutch working. 1) Injecting new URLS works as before 2) initial generate works as before but we might want to do something smarter with DMOZ or wikipedia. 3) fetch works as before based upon the initial urls. We do not follow links - but we still store them as outlinks as usual. 4) we do a new index based upon some new relevance algorithm (eg page mentions items that we are interested in) and mark pages as relevant or not. 5) instead of doing an old style generate or updatedb we go through all the pages which we marked as relevant and take those outlinks for our next iteration. 6) We also inject more urls which are added by the users, and potentially contents of rss files which we know are relevant to our topic. 7) we loop back to 3 above. Eventually we end up with a lucene style index as usual which can be used with the nutch web app, or solr, or some other code Who is interested in this or has done it in the past.... and can we chat about it? Alex