Re: Fetcher for constrained crawls

Michael Ji Sun, 28 Aug 2005 05:59:47 -0700

Hi Jeremy:

1)
I guess the solution/patch provided by Kelvin tries to
enhance site fetching performance in several ways.


One of these is using "HTTP 1.1 features ". His
crawler is a site-depth---a sequence of URLs with the
same host. See his concept at
http://www.supermind.org/index.php?cat=17


2)
I think your approach is based on existing Nutch
scenario with minimum data structure modification in
webDB.

I am running test for Kelvin's patch now. I wonder if
it is possible that you could provide more detail
about your patch so that I can run test as well.

thanks,

Michael Ji

--- Jeremy Calvert <[EMAIL PROTECTED]> wrote:

> Like Kelvin, I too have been trying to get limited
> crawl capabilities 
> out of nutch.
> 
> I've come up with a simplistic approach.  I'm afraid
> I haven't had time 
> to try out Kelvin's approach .
> 
> I extend page to store a depth and a radius byte. 
> Loosely speaking, 
> depth is the distance you can hop within a given
> site (based on 
> domainID), and radius is the distnce you can hop
> once you've left the site.
> 
> You set these when you inject seed URLs.
> 
> When you create new pages from outgoing links, you
> call 
>
linkedPage.propagateDepthAndRadius(pageWithOutgoingLink)
> where:
>     /**
>      * @param incoming The pointing page.
>      */
>     public void propagateDepthAndRadius(Page
> incoming){
>         boolean sameSite = false;
>         try{ sameSite = this.computeDomainID() == 
> incoming.computeDomainID();}
>         catch( MalformedURLException e ) {}//oh
> well, I guess they're 
> different domains.
>         if(sameSite && incoming.depth > 0){
>             this.depth = (byte) (incoming.depth -
> 1);  // same site, 
> decrement depth, maintain radius
>             this.radius = incoming.radius;
>         }else{
>             this.depth = 0;                     //
> different sites or 
> out of depth, decrement radius
>             this.radius = (byte) (incoming.radius -
> 1);
>         }
>     }
> 
> If the page already exists when you go to add it to
> the DB (with 
> instruction ADD_PAGE_IFN_PRESENT), you take the max
> of existing depth 
> and radius with the newly assigned depth and radius.
> 
> The overall code modifications are about 30
> lines...small additions to 
> WebDBWriter and Page.
> 
>  From there, it's fun and handy to have depth and
> radius at your 
> disposal when creating the fetchlist.  I've written
> a new FetchListTool 
> to make use of them to keep things that are at the
> end of constraints 
> out and prioritize pages to fetch.  I also perturb
> the priorities 
> slightly by 0.001% so that, if I do have enough
> domains to prevent my 
> fetches from piling up on a single host, I generally
> do.
> 
> Impacts:
> WebDBWriter (12 lines)
> Page (~20 lines)
> Requires new or modified FetchList tool.
> 
> It's a simple and elegant solution for constrained
> crawl, but it does 
> touch the WebDB.  I'm interested to hear people's
> thoughts, and would be 
> more than happy to contribute a patch.
> 
> J
> 


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com

Re: Fetcher for constrained crawls

Reply via email to