Hi Jeremy: 1) I guess the solution/patch provided by Kelvin tries to enhance site fetching performance in several ways.
One of these is using "HTTP 1.1 features ". His crawler is a site-depth---a sequence of URLs with the same host. See his concept at http://www.supermind.org/index.php?cat=17 2) I think your approach is based on existing Nutch scenario with minimum data structure modification in webDB. I am running test for Kelvin's patch now. I wonder if it is possible that you could provide more detail about your patch so that I can run test as well. thanks, Michael Ji --- Jeremy Calvert <[EMAIL PROTECTED]> wrote: > Like Kelvin, I too have been trying to get limited > crawl capabilities > out of nutch. > > I've come up with a simplistic approach. I'm afraid > I haven't had time > to try out Kelvin's approach . > > I extend page to store a depth and a radius byte. > Loosely speaking, > depth is the distance you can hop within a given > site (based on > domainID), and radius is the distnce you can hop > once you've left the site. > > You set these when you inject seed URLs. > > When you create new pages from outgoing links, you > call > linkedPage.propagateDepthAndRadius(pageWithOutgoingLink) > where: > /** > * @param incoming The pointing page. > */ > public void propagateDepthAndRadius(Page > incoming){ > boolean sameSite = false; > try{ sameSite = this.computeDomainID() == > incoming.computeDomainID();} > catch( MalformedURLException e ) {}//oh > well, I guess they're > different domains. > if(sameSite && incoming.depth > 0){ > this.depth = (byte) (incoming.depth - > 1); // same site, > decrement depth, maintain radius > this.radius = incoming.radius; > }else{ > this.depth = 0; // > different sites or > out of depth, decrement radius > this.radius = (byte) (incoming.radius - > 1); > } > } > > If the page already exists when you go to add it to > the DB (with > instruction ADD_PAGE_IFN_PRESENT), you take the max > of existing depth > and radius with the newly assigned depth and radius. > > The overall code modifications are about 30 > lines...small additions to > WebDBWriter and Page. > > From there, it's fun and handy to have depth and > radius at your > disposal when creating the fetchlist. I've written > a new FetchListTool > to make use of them to keep things that are at the > end of constraints > out and prioritize pages to fetch. I also perturb > the priorities > slightly by 0.001% so that, if I do have enough > domains to prevent my > fetches from piling up on a single host, I generally > do. > > Impacts: > WebDBWriter (12 lines) > Page (~20 lines) > Requires new or modified FetchList tool. > > It's a simple and elegant solution for constrained > crawl, but it does > touch the WebDB. I'm interested to hear people's > thoughts, and would be > more than happy to contribute a patch. > > J > __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
