Like Kelvin, I too have been trying to get limited crawl capabilities
out of nutch.
I've come up with a simplistic approach. I'm afraid I haven't had time
to try out Kelvin's approach .
I extend page to store a depth and a radius byte. Loosely speaking,
depth is the distance you can hop within a given site (based on
domainID), and radius is the distnce you can hop once you've left the site.
You set these when you inject seed URLs.
When you create new pages from outgoing links, you call
linkedPage.propagateDepthAndRadius(pageWithOutgoingLink) where:
/**
* @param incoming The pointing page.
*/
public void propagateDepthAndRadius(Page incoming){
boolean sameSite = false;
try{ sameSite = this.computeDomainID() ==
incoming.computeDomainID();}
catch( MalformedURLException e ) {}//oh well, I guess they're
different domains.
if(sameSite && incoming.depth > 0){
this.depth = (byte) (incoming.depth - 1); // same site,
decrement depth, maintain radius
this.radius = incoming.radius;
}else{
this.depth = 0; // different sites or
out of depth, decrement radius
this.radius = (byte) (incoming.radius - 1);
}
}
If the page already exists when you go to add it to the DB (with
instruction ADD_PAGE_IFN_PRESENT), you take the max of existing depth
and radius with the newly assigned depth and radius.
The overall code modifications are about 30 lines...small additions to
WebDBWriter and Page.
From there, it's fun and handy to have depth and radius at your
disposal when creating the fetchlist. I've written a new FetchListTool
to make use of them to keep things that are at the end of constraints
out and prioritize pages to fetch. I also perturb the priorities
slightly by 0.001% so that, if I do have enough domains to prevent my
fetches from piling up on a single host, I generally do.
Impacts:
WebDBWriter (12 lines)
Page (~20 lines)
Requires new or modified FetchList tool.
It's a simple and elegant solution for constrained crawl, but it does
touch the WebDB. I'm interested to hear people's thoughts, and would be
more than happy to contribute a patch.
J