On Thu, 01 Sep 2005 09:36:19 -0700, Doug Cutting wrote:
> It would be worth considering which features of your constrained
> crawler   could be cast as improvements to Nutch's existing tools
> (e.g., more seed url formats, more output formats, http 1.1, custom
> scopes, etc.) and which require a different control flow (online
> fetchlist building?).   In some cases (e.g., fetch prioritization)
> perhaps a new Plugin should be added to Nutch.

In most cases, it is merely a generalization of what Nutch already has, 
introducing interfaces where appropriate to make it easier to modify behavior. 
I've come to see the importance of making scoring pluggable (essential for 
focused crawling), and also both host-based (current nutch-84) and score-based 
(current nutch) fetch prioritization.

There are some departures which need to be reconciled, in particular the role 
of fetchlists and the way they are built. However, I do not see any major 
incompatibilities between whole-web and focused crawling requirements.

In some cases, though, focused crawling requirements may require extra data to 
be stored, which is not useful for whole-web, for example, storing a url's 
parent and seed url and its depth(essential for crawl scopes).

k

Reply via email to