Re: Distributed Crawl

Andy Liu Wed, 06 Jul 2005 10:02:24 -0700

Sure, you can overlap like you described in your post.  For each page
emitted during fetchlist generation, the page's nextFetch time is set
forward 7 days within the webdb.  This is specifically done so that
the same page won't be emit during subsequent runs of generate, within
7 days.


Keep in mind that the webdb is rebuilt from scratch everytime that you
perform generate or updatedb operations.  So for example, running
updatedb on 2 segments separately would take much longer than running
updatedb both segments at the same time.

On 7/6/05, Rod Taylor <[EMAIL PROTECTED]> wrote:
> On Wed, 2005-07-06 at 09:36 -0400, Andy Liu wrote:
> > You can emit multiple fetchlists using the -numFetchers option, copy
> > each segment to a different machine to fetch, copy the segments back,
> > and run updatedb on all the segments.
> 
> Is it safe to create a new fetchlist before all fetchlists from the
> previous generation have had updatedb run on them?
> 
> That is can we overlap cycles a bit?
> 
> 1) fetchlist -numFetchers 2
> 2) fetch segment1
> 3) updatedb segment1
> 4) fork process with one part continuing and another part returning to
> 1)
> 5) fetch segment2
> 6) updatedb segment2
> 7) process exits
> 
> While fetchlist and updatedb are running, the other segments would be
> fetched and updated shortly after.
> 
> Experimentally I believe I've determined that it is provided you can get
> to 7) within 7 days.
> 
> > On 7/6/05, Karen Church <[EMAIL PROTECTED]> wrote:
> > > Hi All,
> > >
> > > I was wondering if someone could point me in the right direction for 
> > > carrying out a distributed crawl.  Basically I was to split a crawl over 
> > > a few machines. Is there a way of just 'fetching' the pages using 
> > > multiple machines and then merging the results onto a single machine? Can 
> > > I then run the Nutch indexing process over that single machine?
> > >
> > > Thanks
> > > Karen
> > >
> > >
> >
> --
> Rod Taylor <[EMAIL PROTECTED]>
> 
>

Re: Distributed Crawl

Reply via email to