[Nutch-dev] Re: Fetcher for constrained crawls

ogjunk-nutch Tue, 23 Aug 2005 08:04:27 -0700

>From what I heard from Kelvin, the Spring part could be thrown out and
replaced with classes with main().


I think there is a need for having the Fetcher component more separated
from the rest of Nutch.  The Fetcher alone is well done and quite
powerful on its own - it has host-based queues, doesn't use much
RAM/CPU, it's polite, and so on.  For instance, for Simpy.com I'm
currently using only the Fetcher (+ segment data it creates).  I feed
it URLs to fetch my own way, and I never use 'bin/nutch' to run all
those other tools that work on WebDB.

Like Stephan, I thought map-reduce implementation was going to be more
complex to run.

Otis


--- Piotr Kosiorowski <[EMAIL PROTECTED]> wrote:

> Hi,
> I think it is an interesting idea but from technical pespective
> decision 
> to use HiveMind or Spring should be taken for the whole project in my
> 
> opinion. The same goes for JDK 5.0. So right now it is not a best
> match
> for Nutch.
> 
> On the functionality side I am not the best person to judge it as I
> am 
> doing rather big crawls with many hosts, but it sounds interesting.
> 
> Regards,
> Piotr
> 
> 
> 
> Erik Hatcher wrote:
> > Kelvin,
> > 
> > Big +1!!!  I'm working on focused crawling as well, and your work 
> fits 
> > well with my needs.
> > 
> > An implementation detail - have you considered using HiveMind
> rather  
> > than Spring?  This would be much more compatible license-wise with 
> 
> > Nutch and be easier to integrate into the ASF repository.  Further
> -  I 
> > wonder if the existing plugin mechanism would work well as a  
> > HiveMind-based system too.
> > 
> >     Erik
> > 
> > On Aug 23, 2005, at 12:02 AM, Kelvin Tan wrote:
> > 
> >> I've been working on some changes to crawling to facilitate its
> use  
> >> as a non-whole-web crawler, and would like to gauge interest on 
> this 
> >> list about including it somewhere in the Nutch repo, hopefully 
> before 
> >> the map-red brance gets merged in.
> >>
> >> It is basically a partial re-write of the whole fetching
> mechanism,  
> >> borrowing large chunks of code here and there.
> >>
> >> Features include:
> >> - Customizable seed inputs, i.e. seed a crawl from a file, 
> database, 
> >> Nutch FetchList, etc
> >> - Customizable crawl scopes, e.g. crawl the seed URLs and only the
>  
> >> urls within their domains. (this can already be manually 
> accomplished 
> >> with RegexURLFilter, but what if there are 200,000  seed URLs?),
> or 
> >> crawl seed url domains + 1 external link (not  possible with
> current 
> >> filter mechanism)
> >> - Online fetchlist building (as opposed to Nutchs offline
> method),  
> >> and customizable strategies for building a fetchlist. The default 
> 
> >> implementation gives priority to hosts with a larger number of 
> pages 
> >> to crawl. Note that offline fetchlist building is ok too.
> >> - Runs continuously until all links are crawled
> >> - Customizable fetch output mechanisms, like output to file, to  
> >> WebDB, or even not at all (if were just implementing a link-
> checker, 
> >> for example)
> >> - Fully utilizes HTTP 1.1 connection persistence and request 
> pipelining
> >>
> >> It is fully compatible with Nutch as it is, i.e. given a Nutch  
> >> fetchlist, the new crawler can produce a Nutch segment. However,
> if  
> >> you dont need that at all, and are just interested in Nutch as a 
> 
> >> crawler, then thats ok too!
> >>
> >> It is a drop-in replacement for the Nutch crawler, and compiles 
> with 
> >> the recently released 0.7 jar.
> >>
> >> Some disclaimers:
> >> It was never designed to be a superset replacement for the Nutch  
> >> crawler. Rather, it is tailored to fairly specific requirements of
>  
> >> what I believe is called constrained crawling. It uses Spring  
> >> Framework (for easy customization of implementation classes) and 
> JDK 
> >> 5 features (occasional new loop syntax, autoboxing, generics, 
> etc). 
> >> These 2 points speeded up dev. but probably make it a untasty 
> Nutch 
> >> acquisition.. ;-) But it shouldn't be tough to do something  about
> that..
> >>
> >> One of the areas the Nutch Crawler can use with improvement is in 
> the 
> >> fact that its really difficult to extend and customize. With  the 
> >> addition of interfaces and beans, its possible for developers  to 
> >> develop their own mechanism for fetchlist prioritization, or use 
> a 
> >> B-Tree as the backing implementation of the database of crawled 
> URLs. 
> >> I'm using Spring to make it easy to change implementation,  and
> make 
> >> loose coupling easy..
> >>
> >> There are some places where existing Nutch functionality is  
> >> duplicated in some way to allow for slight modifications as
> opposed  
> >> to patching the Nutch classes. The rationale behind this approach 
> was 
> >> to simplify integration - much easier to have Our Crawler as a  
> >> separate jar which depends on the Nutch jar. Furthermore if it  
> >> doesn't get accepted into Nutch, no rewriting or patching of Nutch
>  
> >> sources needs to be done.
> >>
> >> Its my belief that if you're using Nutch for anything but
> whole-web  
> >> crawling and need to make even small changes to the way the 
> crawling 
> >> is performed, you'll find Our Crawler helpful.
> >>
> >> I consider current code as beta quality. I've run it on smallish  
> >> crawls (200k+ URLs) and things seem to be working ok, but nowhere 
> 
> >> near production quality.
> >>
> >> Some related blog entries:
> >>
> >> Improving Nutch for constrained crawls
> >> http://www.supermind.org/index.php?p=274
> >>
> >> Reflections on modifying the Nutch crawler
> >> http://www.supermind.org/index.php?p=283
> >>
> >> Limitations of OC
> >> http://www.supermind.org/index.php?p=284
> >>
> >> Even if we decide not to include in Nutch repo, the code will
> still  
> >> be released under APL. I'm in the process of adding abit more  
> >> documentation, and a shell script for running, and will release
> the  
> >> files over the next couple days.
> >>
> >> Cheers,
> >> Kelvin
> >>
> >> http://www.supermind.org
> >>
> > 
> > 
> 
> 
> 



-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] Re: Fetcher for constrained crawls

Reply via email to