[Nutch-general] Re: Distributed Crawl

Andy Liu Wed, 06 Jul 2005 10:22:48 -0700

>From what you described in your last post, it seems that you're
generating all 10 segments right off the bat.  If you look more
closely at the CrawlTool code, for each iteration, in addition to
generating the fetchlist, it's fetching the segment, and updating the
webdb with the results of the fetch.


It sounds like for your uses CrawlTool may be sufficient.  What's your
rationale behind using multiple machines to fetch?  Using multiple
machines may not always speed up the crawling process, especially
since you're adding in extra time and complexity by generating
multiple segments and copying them back and forth between machines.

Another option is to try using NDFS, although I haven't used it in
months and I don't know how supported it is anymore.  Or you can wait
until Doug releases his MapReduce code.

Andy

On 7/6/05, Karen Church <[EMAIL PROTECTED]> wrote:
> Thanks for reply Andy.  Based on what you've suggested I've been doing some
> tests.  I've:
> 
> 1. created a db
> 2. injected a set of urls
> 3. used the FetchListTool with the -numFetchers option to specify how many
> fetchlists I want to emit.
> 
> However, I'm a little confused about something.  Lets say I want to crawl to
> a depth of 5.  In looking at the CrawlTool code, the FetchListTool is just
> called X number of times where X is the desired crawl depth.  I've taken a
> similar approach, i.e. I've created a small application that calls the
> FetchListTool X number of times using a for loop, where the bound on the
> loop is the crawl depth I want which in this case is 5.
> 
> For test purposes, I'm setting the numfetchers to 2 and the crawl depth to
> 5.  The result is 10 segment directories each with its on fetchlist.
> Example:
> 
> segment1-0
> segment2-0
> segment3-0
> segment4-0
> segment5-0
> segment1-1
> segment2-1
> segment3-1
> segment4-1
> segment5-1
> 
> Am I on the right track with this approach???
> 
> So for me to carry out the fetch process on multiple machines do I just copy
> these segments onto the various machines and fetch the segments using the
> 'Fetcher' class?
> 
> Apologies if my questions seem strange - I'm just a little confused about
> whether or not I'm on the right track and how to proceed from here....
> 
> Thanks,
> Karen
> 
> ----- Original Message -----
> From: "Andy Liu" <[EMAIL PROTECTED]>
> To: <[email protected]>
> Sent: Wednesday, July 06, 2005 2:36 PM
> Subject: Re: Distributed Crawl
> 
> 
> You can emit multiple fetchlists using the -numFetchers option, copy
> each segment to a different machine to fetch, copy the segments back,
> and run updatedb on all the segments.
> 
> Andy
> 
> On 7/6/05, Karen Church <[EMAIL PROTECTED]> wrote:
> > Hi All,
> >
> > I was wondering if someone could point me in the right direction for
> > carrying out a distributed crawl.  Basically I was to split a crawl over a
> > few machines. Is there a way of just 'fetching' the pages using multiple
> > machines and then merging the results onto a single machine? Can I then
> > run the Nutch indexing process over that single machine?
> >
> > Thanks
> > Karen
> >
> >
> 
>


-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_idt77&alloc_id492&op=click
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] Re: Distributed Crawl

Reply via email to