Focussed Web Crawling with Nutch

Alex McLintock Fri, 31 Jul 2009 03:51:38 -0700

I've been using a perl based focussed web crawler with a MySQL back
end, but am now looking at Nutch instead. It seems like a few other
people have done something similar. I'm wondering whether we could
pool our resources and work together on this?


It seems to me that we would be building a few extra plugins. Here is
how I see a focussed nutch working.

1) Injecting new URLS works as before
2) initial generate works as before but we might want to do something
smarter with DMOZ or wikipedia.
3) fetch works as before based upon the initial urls. We do not follow
links - but we still store them as outlinks as usual.
4) we do a new index based upon some new relevance algorithm (eg page
mentions items that we are interested in) and mark pages as relevant
or not.
5) instead of doing an old style generate or updatedb we go through
all the pages which we marked as relevant and take those outlinks for
our next iteration.
6) We also inject more urls which are added by the users, and
potentially contents of rss files which we know are relevant to our
topic.
7) we loop back to 3 above.

Eventually we end up with a lucene style index as usual which can be
used with the nutch web app, or solr, or some other code

Who is interested in this or has done it in the past.... and can we
chat about it?

Alex

Focussed Web Crawling with Nutch

Reply via email to