Daniel, Nutch doesn't do anything by itself, you have to initiate the refetch process by running something like:
bin/nutch generate -refetchonly db segments -numFetchers 30 -topN 30000000 Something like that would do your refetch of the top 30 million docuements and give you roughly 30 segments of 1 million +/- urls in each segment. YOu could then move these segments (or nfs mount them) on your spider boxes and fetch them concurrently (on segment per box or something) machine 1: bin/nutch fetch segments/200505012345-0 machine 2: bin/nutch fetch segments/200505012345-1 .... so on and so forth.... Hopefully with the new stuff Doug is working on perhaps "fetch/spider" boxes can have a rule they apply against the DB for constant fetching/updates without this much manual intervention. -byron -----Original Message----- From: "Daniel D." <[EMAIL PROTECTED]> To: [email protected] Date: Sun, 5 Jun 2005 17:32:53 -0400 Subject: Recrawl, New URLS and Nutch on multiple machines ! > Hi, > > I wanted to try out Nutch and understand how to setup the whole > Internet crawling. It was very easy to follow the tutorial for > Whole-web Crawling but I got some questions: > > 1. I have read that by default Nutch will recrawl urls every 30 days. > I have said "Nutch" but I really don't know who is triggering the > recrawl? Fetcher thread is stopping as soon as all fetcher threads are > done. Tutorial advises to perform different steps in order to do the > "Whole-web Crawling": generate, inject, fecth, index. > > What command (component ) will create thread which will > remain alive and trigger the recrawl? > > 2. How newly discovered URLs are being crawled? > > 3. How can I run Nutch crawler on multiple machines? > > > Will appreciate your help!! > > Thanks, > Daniel >
