I do something like this... I update the URL scores based on my own algorithm which works on parse data. Works great.
2009/7/31 Ken Krugler <[email protected]> > Hi Alex, > > There has been discussion on focused web crawling using Nutch in the past, > so you probably want to check the archives. > > Key aspect is using the scoring plugin API to rate pages (and outlinks from > pages), which then can be used to do a more efficient job of fetching pages > that are likely to be of interest, as they have more interesting pages > pointing to them. > > -- Ken > > > > On Jul 31, 2009, at 3:07am, Alex McLintock wrote: > > I've been using a perl based focussed web crawler with a MySQL back >> end, but am now looking at Nutch instead. It seems like a few other >> people have done something similar. I'm wondering whether we could >> pool our resources and work together on this? >> >> It seems to me that we would be building a few extra plugins. Here is >> how I see a focussed nutch working. >> >> 1) Injecting new URLS works as before >> 2) initial generate works as before but we might want to do something >> smarter with DMOZ or wikipedia. >> 3) fetch works as before based upon the initial urls. We do not follow >> links - but we still store them as outlinks as usual. >> 4) we do a new index based upon some new relevance algorithm (eg page >> mentions items that we are interested in) and mark pages as relevant >> or not. >> 5) instead of doing an old style generate or updatedb we go through >> all the pages which we marked as relevant and take those outlinks for >> our next iteration. >> 6) We also inject more urls which are added by the users, and >> potentially contents of rss files which we know are relevant to our >> topic. >> 7) we loop back to 3 above. >> >> Eventually we end up with a lucene style index as usual which can be >> used with the nutch web app, or solr, or some other code >> >> Who is interested in this or has done it in the past.... and can we >> chat about it? >> >> Alex >> > > -------------------------- > Ken Krugler > TransPac Software, Inc. > <http://www.transpac.com> > +1 530-210-6378 > > -- -MilleBii-
