Re: Distributed Crawl

Rod Taylor Wed, 06 Jul 2005 08:39:19 -0700

On Wed, 2005-07-06 at 09:36 -0400, Andy Liu wrote:
> You can emit multiple fetchlists using the -numFetchers option, copy
> each segment to a different machine to fetch, copy the segments back,
> and run updatedb on all the segments.

Is it safe to create a new fetchlist before all fetchlists from the
previous generation have had updatedb run on them? 

That is can we overlap cycles a bit?

1) fetchlist -numFetchers 2
2) fetch segment1
3) updatedb segment1
4) fork process with one part continuing and another part returning to
1)
5) fetch segment2
6) updatedb segment2
7) process exits

While fetchlist and updatedb are running, the other segments would be
fetched and updated shortly after.

Experimentally I believe I've determined that it is provided you can get
to 7) within 7 days.

> On 7/6/05, Karen Church <[EMAIL PROTECTED]> wrote:
> > Hi All,
> > 
> > I was wondering if someone could point me in the right direction for 
> > carrying out a distributed crawl.  Basically I was to split a crawl over a 
> > few machines. Is there a way of just 'fetching' the pages using multiple 
> > machines and then merging the results onto a single machine? Can I then run 
> > the Nutch indexing process over that single machine?
> > 
> > Thanks
> > Karen
> > 
> >
> 
-- 
Rod Taylor <[EMAIL PROTECTED]>

Re: Distributed Crawl

Reply via email to