Re: Recrawl, New URLS and Nutch on multiple machines !

Byron Miller Mon, 06 Jun 2005 05:53:04 -0700

Daniel,

Nutch doesn't do anything by itself, you have to initiate the refetch
process by running something like:


bin/nutch generate -refetchonly db segments -numFetchers 30 -topN 30000000


Something like that would do your refetch of the top 30 million docuements
and give you roughly 30 segments of 1 million +/- urls in each segment.

YOu could then move these segments (or nfs mount them) on your spider
boxes and fetch them concurrently (on segment per box or something)

machine 1: bin/nutch fetch segments/200505012345-0
machine 2: bin/nutch fetch segments/200505012345-1

.... so on and so forth....

Hopefully with the new stuff Doug is working on perhaps "fetch/spider"
boxes can have a rule they apply against the DB for constant
fetching/updates without this much manual intervention.

-byron

-----Original Message-----
From: "Daniel D." <[EMAIL PROTECTED]>
To: [email protected]
Date: Sun, 5 Jun 2005 17:32:53 -0400
Subject: Recrawl, New URLS and Nutch on multiple machines !

> Hi,
> 
> I wanted to try out Nutch and understand how to setup the whole
> Internet crawling. It was very easy to follow the tutorial for
> Whole-web Crawling but I got some questions:
> 
> 1.    I have read that by default Nutch will recrawl urls every 30 days.
> I have said "Nutch" but I really don't know who is triggering the
> recrawl? Fetcher thread is stopping as soon as all fetcher threads are
> done. Tutorial advises to perform different steps in order to do the
> "Whole-web Crawling": generate, inject, fecth, index.
> 
>            What command (component ) will create thread which will
> remain alive and trigger the recrawl?
> 
> 2.    How newly discovered URLs are being crawled? 
> 
> 3.    How can I run Nutch crawler on multiple machines?  
> 
> 
> Will appreciate your help!!
> 
> Thanks,
> Daniel
>

Re: Recrawl, New URLS and Nutch on multiple machines !

Reply via email to