I do something like this... I update the URL scores based on my own
algorithm which works on parse data.
Works great.

2009/7/31 Ken Krugler <[email protected]>

> Hi Alex,
>
> There has been discussion on focused web crawling using Nutch in the past,
> so you probably want to check the archives.
>
> Key aspect is using the scoring plugin API to rate pages (and outlinks from
> pages), which then can be used to do a more efficient job of fetching pages
> that are likely to be of interest, as they have more interesting pages
> pointing to them.
>
> -- Ken
>
>
>
> On Jul 31, 2009, at 3:07am, Alex McLintock wrote:
>
>  I've been using a perl based focussed web crawler with a MySQL back
>> end, but am now looking at Nutch instead. It seems like a few other
>> people have done something similar. I'm wondering whether we could
>> pool our resources and work together on this?
>>
>> It seems to me that we would be building a few extra plugins. Here is
>> how I see a focussed nutch working.
>>
>> 1) Injecting new URLS works as before
>> 2) initial generate works as before but we might want to do something
>> smarter with DMOZ or wikipedia.
>> 3) fetch works as before based upon the initial urls. We do not follow
>> links - but we still store them as outlinks as usual.
>> 4) we do a new index based upon some new relevance algorithm (eg page
>> mentions items that we are interested in) and mark pages as relevant
>> or not.
>> 5) instead of doing an old style generate or updatedb we go through
>> all the pages which we marked as relevant and take those outlinks for
>> our next iteration.
>> 6) We also inject more urls which are added by the users, and
>> potentially contents of rss files which we know are relevant to our
>> topic.
>> 7) we loop back to 3 above.
>>
>> Eventually we end up with a lucene style index as usual which can be
>> used with the nutch web app, or solr, or some other code
>>
>> Who is interested in this or has done it in the past.... and can we
>> chat about it?
>>
>> Alex
>>
>
> --------------------------
> Ken Krugler
> TransPac Software, Inc.
> <http://www.transpac.com>
> +1 530-210-6378
>
>


-- 
-MilleBii-

Reply via email to