Andrej, Thanks a lot for the ansewrs. Sorry for being persistent in my posts .. .. I was going on vacation for 3 weeks and needed to finish my work before. I appriciate your help. Reagrds, Daniel
On 6/16/05, Andrzej Bialecki <[EMAIL PROTECTED]> wrote: > > Daniel D. wrote: > > Dear Nutch Developers, > > > > I'm trying to get answers to my questions below but nobody is > responding. > > This is why I'm trying to post my questions again. > > Hi Daniel, > > Please see my answers below. Sometimes it takes patience, people have > busy schedules... > > > > > ----------- Question # 1 ------------------------ > > As I understand Nutch crawler is employing crawl & stop with threshold > is > > used with –topN parameter. Please correct me if I'm wrong. This also > means > > that some sites will have different depth the others. > > Yes and no - some pages that are located deep could have a high score > (because of many inlinks), so they would be put on the list for > fetching, even though pages that are closer to root URL may not have > been fetched yet, or indeed will never be fetcher because they score too > low. > > > > > Is there a way to control the crawling depth per domain and number of > URLS > > per domain as well as the total number of domains crawled (in this case > it's > > - topN). > > -topN controls fetching by score. What you want is to control fetching > by depth. Currently the FetchListTool doesn't implement this, but it > would be trivial to add. > > > > > ----------- Question # 2 ------------------------ > > The whole-web crawling tutorial advices to use the following command > > sequence: > > > > Fetch > > > > updatedb db > > > > and then generate db segments -topN 1000 > > > > Use of the topN parameter implies that updatedb db doing some analysis > on > > fetched data. Command analyze (net.nutch.tools.LinkAnalysisTool) is not > > being mentioned in tutorial. DissectingTheNutchCrawler ( > > http://wiki.apache.org/nutch/DissectingTheNutchCrawler) article includes > > this command in the sequance of command for the whole-internaet > crawling. > > > > When should I use command analyze and when might I not use it? > > With the default settings you don't need to use this command. Nutch > approximates the full web-graph scoring by using scoring based on the > number of inlinks. Additionally, this command is known to be slightly > broken... > > > > > I'm trying to get sense on how much memory (hard-drive and RAM) webDB > will > > require and now I also will concern about how much machine resources > will > > analyze consume. Nobody provide this information yet. I will appreciate > if > > somebody will share his knowledge and thoughts here. > > Don't use analyze - it will consume any disk space that you throw at it > ;-) > > WebDB normally consumes ca. 2kB per page. This may temporarily increase > to 3x this number during DB updating. > > > > > I'm looking for something like: for 1,000,000 > > documents WebDB will take approximately XX GB and running bin/nutch > > updatedb on 1,000,000 will use up to XX MB of RAM. > > The last figure depends on the settings of your JVM, i.e. what heap size > you set for JVM. Updatedb should not consume much memory in any case. > > > > > > > ----------- Question # 3 ------------------------ > > > > > > After initial inject and subsequent fetch and updatedb command (s) can I > use > > inject to add more URLS to the WebDB ? > > Yes, of course. > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __________________________________ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > >
