>From what I heard from Kelvin, the Spring part could be thrown out and replaced with classes with main().
I think there is a need for having the Fetcher component more separated from the rest of Nutch. The Fetcher alone is well done and quite powerful on its own - it has host-based queues, doesn't use much RAM/CPU, it's polite, and so on. For instance, for Simpy.com I'm currently using only the Fetcher (+ segment data it creates). I feed it URLs to fetch my own way, and I never use 'bin/nutch' to run all those other tools that work on WebDB. Like Stephan, I thought map-reduce implementation was going to be more complex to run. Otis --- Piotr Kosiorowski <[EMAIL PROTECTED]> wrote: > Hi, > I think it is an interesting idea but from technical pespective > decision > to use HiveMind or Spring should be taken for the whole project in my > > opinion. The same goes for JDK 5.0. So right now it is not a best > match > for Nutch. > > On the functionality side I am not the best person to judge it as I > am > doing rather big crawls with many hosts, but it sounds interesting. > > Regards, > Piotr > > > > Erik Hatcher wrote: > > Kelvin, > > > > Big +1!!! I'm working on focused crawling as well, and your work > fits > > well with my needs. > > > > An implementation detail - have you considered using HiveMind > rather > > than Spring? This would be much more compatible license-wise with > > > Nutch and be easier to integrate into the ASF repository. Further > - I > > wonder if the existing plugin mechanism would work well as a > > HiveMind-based system too. > > > > Erik > > > > On Aug 23, 2005, at 12:02 AM, Kelvin Tan wrote: > > > >> I've been working on some changes to crawling to facilitate its > use > >> as a non-whole-web crawler, and would like to gauge interest on > this > >> list about including it somewhere in the Nutch repo, hopefully > before > >> the map-red brance gets merged in. > >> > >> It is basically a partial re-write of the whole fetching > mechanism, > >> borrowing large chunks of code here and there. > >> > >> Features include: > >> - Customizable seed inputs, i.e. seed a crawl from a file, > database, > >> Nutch FetchList, etc > >> - Customizable crawl scopes, e.g. crawl the seed URLs and only the > > >> urls within their domains. (this can already be manually > accomplished > >> with RegexURLFilter, but what if there are 200,000 seed URLs?), > or > >> crawl seed url domains + 1 external link (not possible with > current > >> filter mechanism) > >> - Online fetchlist building (as opposed to Nutchs offline > method), > >> and customizable strategies for building a fetchlist. The default > > >> implementation gives priority to hosts with a larger number of > pages > >> to crawl. Note that offline fetchlist building is ok too. > >> - Runs continuously until all links are crawled > >> - Customizable fetch output mechanisms, like output to file, to > >> WebDB, or even not at all (if were just implementing a link- > checker, > >> for example) > >> - Fully utilizes HTTP 1.1 connection persistence and request > pipelining > >> > >> It is fully compatible with Nutch as it is, i.e. given a Nutch > >> fetchlist, the new crawler can produce a Nutch segment. However, > if > >> you dont need that at all, and are just interested in Nutch as a > > >> crawler, then thats ok too! > >> > >> It is a drop-in replacement for the Nutch crawler, and compiles > with > >> the recently released 0.7 jar. > >> > >> Some disclaimers: > >> It was never designed to be a superset replacement for the Nutch > >> crawler. Rather, it is tailored to fairly specific requirements of > > >> what I believe is called constrained crawling. It uses Spring > >> Framework (for easy customization of implementation classes) and > JDK > >> 5 features (occasional new loop syntax, autoboxing, generics, > etc). > >> These 2 points speeded up dev. but probably make it a untasty > Nutch > >> acquisition.. ;-) But it shouldn't be tough to do something about > that.. > >> > >> One of the areas the Nutch Crawler can use with improvement is in > the > >> fact that its really difficult to extend and customize. With the > >> addition of interfaces and beans, its possible for developers to > >> develop their own mechanism for fetchlist prioritization, or use > a > >> B-Tree as the backing implementation of the database of crawled > URLs. > >> I'm using Spring to make it easy to change implementation, and > make > >> loose coupling easy.. > >> > >> There are some places where existing Nutch functionality is > >> duplicated in some way to allow for slight modifications as > opposed > >> to patching the Nutch classes. The rationale behind this approach > was > >> to simplify integration - much easier to have Our Crawler as a > >> separate jar which depends on the Nutch jar. Furthermore if it > >> doesn't get accepted into Nutch, no rewriting or patching of Nutch > > >> sources needs to be done. > >> > >> Its my belief that if you're using Nutch for anything but > whole-web > >> crawling and need to make even small changes to the way the > crawling > >> is performed, you'll find Our Crawler helpful. > >> > >> I consider current code as beta quality. I've run it on smallish > >> crawls (200k+ URLs) and things seem to be working ok, but nowhere > > >> near production quality. > >> > >> Some related blog entries: > >> > >> Improving Nutch for constrained crawls > >> http://www.supermind.org/index.php?p=274 > >> > >> Reflections on modifying the Nutch crawler > >> http://www.supermind.org/index.php?p=283 > >> > >> Limitations of OC > >> http://www.supermind.org/index.php?p=284 > >> > >> Even if we decide not to include in Nutch repo, the code will > still > >> be released under APL. I'm in the process of adding abit more > >> documentation, and a shell script for running, and will release > the > >> files over the next couple days. > >> > >> Cheers, > >> Kelvin > >> > >> http://www.supermind.org > >> > > > > > > > ------------------------------------------------------- SF.Net email is Sponsored by the Better Software Conference & EXPO September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
