Hi Alex,

There has been discussion on focused web crawling using Nutch in the past, so you probably want to check the archives.

Key aspect is using the scoring plugin API to rate pages (and outlinks from pages), which then can be used to do a more efficient job of fetching pages that are likely to be of interest, as they have more interesting pages pointing to them.

-- Ken


On Jul 31, 2009, at 3:07am, Alex McLintock wrote:

I've been using a perl based focussed web crawler with a MySQL back
end, but am now looking at Nutch instead. It seems like a few other
people have done something similar. I'm wondering whether we could
pool our resources and work together on this?

It seems to me that we would be building a few extra plugins. Here is
how I see a focussed nutch working.

1) Injecting new URLS works as before
2) initial generate works as before but we might want to do something
smarter with DMOZ or wikipedia.
3) fetch works as before based upon the initial urls. We do not follow
links - but we still store them as outlinks as usual.
4) we do a new index based upon some new relevance algorithm (eg page
mentions items that we are interested in) and mark pages as relevant
or not.
5) instead of doing an old style generate or updatedb we go through
all the pages which we marked as relevant and take those outlinks for
our next iteration.
6) We also inject more urls which are added by the users, and
potentially contents of rss files which we know are relevant to our
topic.
7) we loop back to 3 above.

Eventually we end up with a lucene style index as usual which can be
used with the nutch web app, or solr, or some other code

Who is interested in this or has done it in the past.... and can we
chat about it?

Alex

--------------------------
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-210-6378

Reply via email to