On Thu, 01 Sep 2005 09:36:19 -0700, Doug Cutting wrote:
> It would be worth considering which features of your constrained
> crawler   could be cast as improvements to Nutch's existing tools
> (e.g., more seed url formats, more output formats, http 1.1, custom
> scopes, etc.) and which require a different control flow (online
> fetchlist building?).   In some cases (e.g., fetch prioritization)
> perhaps a new Plugin should be added to Nutch.

In most cases, it is merely a generalization of what Nutch already has, 
introducing interfaces where appropriate to make it easier to modify behavior. 
I've come to see the importance of making scoring pluggable (essential for 
focused crawling), and also both host-based (current nutch-84) and score-based 
(current nutch) fetch prioritization.

There are some departures which need to be reconciled, in particular the role 
of fetchlists and the way they are built. However, I do not see any major 
incompatibilities between whole-web and focused crawling requirements.

In some cases, though, focused crawling requirements may require extra data to 
be stored, which is not useful for whole-web, for example, storing a url's 
parent and seed url and its depth(essential for crawl scopes).

k



-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to