Daniel,

Nutch doesn't do anything by itself, you have to initiate the refetch
process by running something like:

bin/nutch generate -refetchonly db segments -numFetchers 30 -topN 30000000


Something like that would do your refetch of the top 30 million docuements
and give you roughly 30 segments of 1 million +/- urls in each segment.

YOu could then move these segments (or nfs mount them) on your spider
boxes and fetch them concurrently (on segment per box or something)

machine 1: bin/nutch fetch segments/200505012345-0
machine 2: bin/nutch fetch segments/200505012345-1

.... so on and so forth....

Hopefully with the new stuff Doug is working on perhaps "fetch/spider"
boxes can have a rule they apply against the DB for constant
fetching/updates without this much manual intervention.

-byron

-----Original Message-----
From: "Daniel D." <[EMAIL PROTECTED]>
To: [email protected]
Date: Sun, 5 Jun 2005 17:32:53 -0400
Subject: Recrawl, New URLS and Nutch on multiple machines !

> Hi,
> 
> I wanted to try out Nutch and understand how to setup the whole
> Internet crawling. It was very easy to follow the tutorial for
> Whole-web Crawling but I got some questions:
> 
> 1.    I have read that by default Nutch will recrawl urls every 30 days.
> I have said "Nutch" but I really don't know who is triggering the
> recrawl? Fetcher thread is stopping as soon as all fetcher threads are
> done. Tutorial advises to perform different steps in order to do the
> "Whole-web Crawling": generate, inject, fecth, index.
> 
>            What command (component ) will create thread which will
> remain alive and trigger the recrawl?
> 
> 2.    How newly discovered URLs are being crawled? 
> 
> 3.    How can I run Nutch crawler on multiple machines?  
> 
> 
> Will appreciate your help!!
> 
> Thanks,
> Daniel
> 





-------------------------------------------------------
This SF.Net email is sponsored by: NEC IT Guy Games.  How far can you shotput
a projector? How fast can you ride your desk chair down the office luge track?
If you want to score the big prize, get to know the little guy.  
Play to win an NEC 61" plasma display: http://www.necitguy.com/?r=20
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to