I see, I had the idea of depth being the length of the "chain of links" to follow from a site to other sites. for example: lets say i have cnn.com as a url in my root fetchlist. and it has for e.g. a link to www.nbc.com and in www.nbc.com they have alink to www.news.com.
so if i would have choose depth 3, that means i would have crawled www.news.com as well (cnn-->nbc-->news) , i understand now that i was mistaken? So the only way to tell the crawler to keep "digging" inside a url is via the nutch-site.xml file, am i right? thanks, Eyal. On 8/30/07, Gal Nitzan <[EMAIL PROTECTED]> wrote: > > Hey Eyal, > > Actually, in the mode you call "command mode" there is no depth value. > > To be more specific, the depth value is not "folder depth" it means the > number > of times the crawler would run from the basic seeds you entered to it. So > for > example if you put into your seeds 1 url to www.sample.com and in the > crawl > mode you set the "depth" to 3 than the crawler would run 3 times where > each > time the urls found during the previous crawl would be crawld. In the last > stages of the crawl after the crawling stage is done the data would be > indexed. > > So, in the "command mode" to achieve this you would need to write a small > bash > script which would copy that behavior which is: > > For the number of depth > NewSegment = Nutch generate # generate the list of url to fetch > Nutch fetch NewSegment # fetch list of URLs > Nutch updatedb NewSegment # update the status of crawled links and add new > found links. > Next. > > HTH, > > Gal Nitzan. > > > > > -----Original Message----- > > From: eyal edri [mailto:[EMAIL PROTECTED] > > Sent: Thursday, August 30, 2007 10:49 AM > > To: email@example.com > > Subject: depth arg in non crawl mode (fetch) > > > > Hello, > > > > I'm testing nutch 0.9 in the "Whole-Web" approach where i use a set of > > command to run the engine instead of just runing "crawl". > > i.e. nutch inject > > nutch genrate > > nutch fetch > > nutch updatedb.. and so on. > > > > My question is, where can i define the depth arg (same one that appears > in > > the crawl mode), in the broken ('whole web') mode? > > > > thanks, > > > > > > -- > > Eyal Edri > > > -- Eyal Edri