On Wed, 2005-07-06 at 13:02 -0400, Andy Liu wrote: > Sure, you can overlap like you described in your post. For each page > emitted during fetchlist generation, the page's nextFetch time is set > forward 7 days within the webdb. This is specifically done so that > the same page won't be emit during subsequent runs of generate, within > 7 days. > > Keep in mind that the webdb is rebuilt from scratch everytime that you > perform generate or updatedb operations. So for example, running > updatedb on 2 segments separately would take much longer than running > updatedb both segments at the same time.
Thanks. I had thought it was updating in place for some reason. Is there a "best practises" document for a large scale full web crawl? > On 7/6/05, Rod Taylor <[EMAIL PROTECTED]> wrote: > > On Wed, 2005-07-06 at 09:36 -0400, Andy Liu wrote: > > > You can emit multiple fetchlists using the -numFetchers option, copy > > > each segment to a different machine to fetch, copy the segments back, > > > and run updatedb on all the segments. > > > > Is it safe to create a new fetchlist before all fetchlists from the > > previous generation have had updatedb run on them? > > > > That is can we overlap cycles a bit? > > > > 1) fetchlist -numFetchers 2 > > 2) fetch segment1 > > 3) updatedb segment1 > > 4) fork process with one part continuing and another part returning to > > 1) > > 5) fetch segment2 > > 6) updatedb segment2 > > 7) process exits > > > > While fetchlist and updatedb are running, the other segments would be > > fetched and updated shortly after. > > > > Experimentally I believe I've determined that it is provided you can get > > to 7) within 7 days. > > > > > On 7/6/05, Karen Church <[EMAIL PROTECTED]> wrote: > > > > Hi All, > > > > > > > > I was wondering if someone could point me in the right direction for > > > > carrying out a distributed crawl. Basically I was to split a crawl > > > > over a few machines. Is there a way of just 'fetching' the pages using > > > > multiple machines and then merging the results onto a single machine? > > > > Can I then run the Nutch indexing process over that single machine? > > > > > > > > Thanks > > > > Karen > > > > > > > > > > > > > -- > > Rod Taylor <[EMAIL PROTECTED]> > > > > > -- Rod Taylor <[EMAIL PROTECTED]> ------------------------------------------------------- SF.Net email is sponsored by: Discover Easy Linux Migration Strategies from IBM. Find simple to follow Roadmaps, straightforward articles, informative Webcasts and more! Get everything you need to get up to speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
