Gents, I have one more question. Hope anyone will response!!
The whole-web crawling tutorial advices to use the following command sequence: *Fetch*** *updatedb db* and then *generate db segments -topN 1000* Use of the topN parameter implies that *updatedb db *doing some analysis on fetched data. Command *analyze* * *(net.nutch.tools.LinkAnalysisTool) is not being mentioned in tutorial. DissectingTheNutchCrawler<http://wiki.apache.org/nutch/DissectingTheNutchCrawler?action=fullsearch&value=linkto%3A%22DissectingTheNutchCrawler%22&context=180>( http://wiki.apache.org/nutch/DissectingTheNutchCrawler) article includes this command in the sequance of command for the whole-internaet crawling. When should I use command *analyze *and when might I not use it? I'm trying to get sense on how much memory (hard-drive and RAM) webDB will require and now I also will concern about how much machine resources will *analyze *consume. Nobody provide this information yet. I will appreciate if somebody will share his knowledge and thoughts here. I'm looking for something like: for 1,000,000 documents WebDB will take approximately XX GB and running bin/nutch updatedb on 1,000,000 will use up to XX MB of RAM. Thanks, Daniel On 6/13/05, Daniel D. <[EMAIL PROTECTED]> wrote: > > Hi, > > As I understand Nutch crawler is employing crawl & stop with threshold is > used with –topN parameter. Please correct me if I'm wrong. This also means > that some sites will have different depth the others. > > Is there a way to control the crawling depth per domain and number of URLS > per domain as well as the total number of domains crawled (in this case it's > - topN). > > Thanks, > > Daniel >