[Nutch-general] Re: Distributed Crawl

Rod Taylor Wed, 06 Jul 2005 10:37:42 -0700

On Wed, 2005-07-06 at 13:02 -0400, Andy Liu wrote:
> Sure, you can overlap like you described in your post.  For each page
> emitted during fetchlist generation, the page's nextFetch time is set
> forward 7 days within the webdb.  This is specifically done so that
> the same page won't be emit during subsequent runs of generate, within
> 7 days.
> 
> Keep in mind that the webdb is rebuilt from scratch everytime that you
> perform generate or updatedb operations.  So for example, running
> updatedb on 2 segments separately would take much longer than running
> updatedb both segments at the same time.


Thanks. I had thought it was updating in place for some reason.

Is there a "best practises" document for a large scale full web crawl?

> On 7/6/05, Rod Taylor <[EMAIL PROTECTED]> wrote:
> > On Wed, 2005-07-06 at 09:36 -0400, Andy Liu wrote:
> > > You can emit multiple fetchlists using the -numFetchers option, copy
> > > each segment to a different machine to fetch, copy the segments back,
> > > and run updatedb on all the segments.
> > 
> > Is it safe to create a new fetchlist before all fetchlists from the
> > previous generation have had updatedb run on them?
> > 
> > That is can we overlap cycles a bit?
> > 
> > 1) fetchlist -numFetchers 2
> > 2) fetch segment1
> > 3) updatedb segment1
> > 4) fork process with one part continuing and another part returning to
> > 1)
> > 5) fetch segment2
> > 6) updatedb segment2
> > 7) process exits
> > 
> > While fetchlist and updatedb are running, the other segments would be
> > fetched and updated shortly after.
> > 
> > Experimentally I believe I've determined that it is provided you can get
> > to 7) within 7 days.
> > 
> > > On 7/6/05, Karen Church <[EMAIL PROTECTED]> wrote:
> > > > Hi All,
> > > >
> > > > I was wondering if someone could point me in the right direction for 
> > > > carrying out a distributed crawl.  Basically I was to split a crawl 
> > > > over a few machines. Is there a way of just 'fetching' the pages using 
> > > > multiple machines and then merging the results onto a single machine? 
> > > > Can I then run the Nutch indexing process over that single machine?
> > > >
> > > > Thanks
> > > > Karen
> > > >
> > > >
> > >
> > --
> > Rod Taylor <[EMAIL PROTECTED]>
> > 
> >
> 
-- 
Rod Taylor <[EMAIL PROTECTED]>



-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] Re: Distributed Crawl

Reply via email to