Hi,
I think it is an interesting idea but from technical pespective decision to use HiveMind or Spring should be taken for the whole project in my opinion. The same goes for JDK 5.0. So right now it is not a best match
for Nutch.

On the functionality side I am not the best person to judge it as I am doing rather big crawls with many hosts, but it sounds interesting.

Regards,
Piotr



Erik Hatcher wrote:
Kelvin,

Big +1!!! I'm working on focused crawling as well, and your work fits well with my needs.

An implementation detail - have you considered using HiveMind rather than Spring? This would be much more compatible license-wise with Nutch and be easier to integrate into the ASF repository. Further - I wonder if the existing plugin mechanism would work well as a HiveMind-based system too.

    Erik

On Aug 23, 2005, at 12:02 AM, Kelvin Tan wrote:

I've been working on some changes to crawling to facilitate its use as a non-whole-web crawler, and would like to gauge interest on this list about including it somewhere in the Nutch repo, hopefully before the map-red brance gets merged in.

It is basically a partial re-write of the whole fetching mechanism, borrowing large chunks of code here and there.

Features include:
- Customizable seed inputs, i.e. seed a crawl from a file, database, Nutch FetchList, etc - Customizable crawl scopes, e.g. crawl the seed URLs and only the urls within their domains. (this can already be manually accomplished with RegexURLFilter, but what if there are 200,000 seed URLs?), or crawl seed url domains + 1 external link (not possible with current filter mechanism) - Online fetchlist building (as opposed to Nutch’s offline method), and customizable strategies for building a fetchlist. The default implementation gives priority to hosts with a larger number of pages to crawl. Note that offline fetchlist building is ok too.
- Runs continuously until all links are crawled
- Customizable fetch output mechanisms, like output to file, to WebDB, or even not at all (if we’re just implementing a link- checker, for example)
- Fully utilizes HTTP 1.1 connection persistence and request  pipelining

It is fully compatible with Nutch as it is, i.e. given a Nutch fetchlist, the new crawler can produce a Nutch segment. However, if you don’t need that at all, and are just interested in Nutch as a crawler, then that’s ok too!

It is a drop-in replacement for the Nutch crawler, and compiles with the recently released 0.7 jar.

Some disclaimers:
It was never designed to be a superset replacement for the Nutch crawler. Rather, it is tailored to fairly specific requirements of what I believe is called constrained crawling. It uses Spring Framework (for easy customization of implementation classes) and JDK 5 features (occasional new loop syntax, autoboxing, generics, etc). These 2 points speeded up dev. but probably make it a untasty Nutch acquisition.. ;-) But it shouldn't be tough to do something about that..

One of the areas the Nutch Crawler can use with improvement is in the fact that its really difficult to extend and customize. With the addition of interfaces and beans, its possible for developers to develop their own mechanism for fetchlist prioritization, or use a B-Tree as the backing implementation of the database of crawled URLs. I'm using Spring to make it easy to change implementation, and make loose coupling easy..

There are some places where existing Nutch functionality is duplicated in some way to allow for slight modifications as opposed to patching the Nutch classes. The rationale behind this approach was to simplify integration - much easier to have Our Crawler as a separate jar which depends on the Nutch jar. Furthermore if it doesn't get accepted into Nutch, no rewriting or patching of Nutch sources needs to be done.

Its my belief that if you're using Nutch for anything but whole-web crawling and need to make even small changes to the way the crawling is performed, you'll find Our Crawler helpful.

I consider current code as beta quality. I've run it on smallish crawls (200k+ URLs) and things seem to be working ok, but nowhere near production quality.

Some related blog entries:

Improving Nutch for constrained crawls
http://www.supermind.org/index.php?p=274

Reflections on modifying the Nutch crawler
http://www.supermind.org/index.php?p=283

Limitations of OC
http://www.supermind.org/index.php?p=284

Even if we decide not to include in Nutch repo, the code will still be released under APL. I'm in the process of adding abit more documentation, and a shell script for running, and will release the files over the next couple days.

Cheers,
Kelvin

http://www.supermind.org







-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to