Sure, you can overlap like you described in your post. For each page emitted during fetchlist generation, the page's nextFetch time is set forward 7 days within the webdb. This is specifically done so that the same page won't be emit during subsequent runs of generate, within 7 days.
Keep in mind that the webdb is rebuilt from scratch everytime that you perform generate or updatedb operations. So for example, running updatedb on 2 segments separately would take much longer than running updatedb both segments at the same time. On 7/6/05, Rod Taylor <[EMAIL PROTECTED]> wrote: > On Wed, 2005-07-06 at 09:36 -0400, Andy Liu wrote: > > You can emit multiple fetchlists using the -numFetchers option, copy > > each segment to a different machine to fetch, copy the segments back, > > and run updatedb on all the segments. > > Is it safe to create a new fetchlist before all fetchlists from the > previous generation have had updatedb run on them? > > That is can we overlap cycles a bit? > > 1) fetchlist -numFetchers 2 > 2) fetch segment1 > 3) updatedb segment1 > 4) fork process with one part continuing and another part returning to > 1) > 5) fetch segment2 > 6) updatedb segment2 > 7) process exits > > While fetchlist and updatedb are running, the other segments would be > fetched and updated shortly after. > > Experimentally I believe I've determined that it is provided you can get > to 7) within 7 days. > > > On 7/6/05, Karen Church <[EMAIL PROTECTED]> wrote: > > > Hi All, > > > > > > I was wondering if someone could point me in the right direction for > > > carrying out a distributed crawl. Basically I was to split a crawl over > > > a few machines. Is there a way of just 'fetching' the pages using > > > multiple machines and then merging the results onto a single machine? Can > > > I then run the Nutch indexing process over that single machine? > > > > > > Thanks > > > Karen > > > > > > > > > -- > Rod Taylor <[EMAIL PROTECTED]> > >
