Like Kelvin, I too have been trying to get limited crawl capabilities out of nutch.

I've come up with a simplistic approach. I'm afraid I haven't had time to try out Kelvin's approach .

I extend page to store a depth and a radius byte. Loosely speaking, depth is the distance you can hop within a given site (based on domainID), and radius is the distnce you can hop once you've left the site.

You set these when you inject seed URLs.

When you create new pages from outgoing links, you call linkedPage.propagateDepthAndRadius(pageWithOutgoingLink) where:
   /**
    * @param incoming The pointing page.
    */
   public void propagateDepthAndRadius(Page incoming){
       boolean sameSite = false;
try{ sameSite = this.computeDomainID() == incoming.computeDomainID();} catch( MalformedURLException e ) {}//oh well, I guess they're different domains.
       if(sameSite && incoming.depth > 0){
this.depth = (byte) (incoming.depth - 1); // same site, decrement depth, maintain radius
           this.radius = incoming.radius;
       }else{
this.depth = 0; // different sites or out of depth, decrement radius
           this.radius = (byte) (incoming.radius - 1);
       }
   }

If the page already exists when you go to add it to the DB (with instruction ADD_PAGE_IFN_PRESENT), you take the max of existing depth and radius with the newly assigned depth and radius.

The overall code modifications are about 30 lines...small additions to WebDBWriter and Page.

From there, it's fun and handy to have depth and radius at your disposal when creating the fetchlist. I've written a new FetchListTool to make use of them to keep things that are at the end of constraints out and prioritize pages to fetch. I also perturb the priorities slightly by 0.001% so that, if I do have enough domains to prevent my fetches from piling up on a single host, I generally do.

Impacts:
WebDBWriter (12 lines)
Page (~20 lines)
Requires new or modified FetchList tool.

It's a simple and elegant solution for constrained crawl, but it does touch the WebDB. I'm interested to hear people's thoughts, and would be more than happy to contribute a patch.

J

Reply via email to