>From what you described in your last post, it seems that you're generating all 10 segments right off the bat. If you look more closely at the CrawlTool code, for each iteration, in addition to generating the fetchlist, it's fetching the segment, and updating the webdb with the results of the fetch.
It sounds like for your uses CrawlTool may be sufficient. What's your rationale behind using multiple machines to fetch? Using multiple machines may not always speed up the crawling process, especially since you're adding in extra time and complexity by generating multiple segments and copying them back and forth between machines. Another option is to try using NDFS, although I haven't used it in months and I don't know how supported it is anymore. Or you can wait until Doug releases his MapReduce code. Andy On 7/6/05, Karen Church <[EMAIL PROTECTED]> wrote: > Thanks for reply Andy. Based on what you've suggested I've been doing some > tests. I've: > > 1. created a db > 2. injected a set of urls > 3. used the FetchListTool with the -numFetchers option to specify how many > fetchlists I want to emit. > > However, I'm a little confused about something. Lets say I want to crawl to > a depth of 5. In looking at the CrawlTool code, the FetchListTool is just > called X number of times where X is the desired crawl depth. I've taken a > similar approach, i.e. I've created a small application that calls the > FetchListTool X number of times using a for loop, where the bound on the > loop is the crawl depth I want which in this case is 5. > > For test purposes, I'm setting the numfetchers to 2 and the crawl depth to > 5. The result is 10 segment directories each with its on fetchlist. > Example: > > segment1-0 > segment2-0 > segment3-0 > segment4-0 > segment5-0 > segment1-1 > segment2-1 > segment3-1 > segment4-1 > segment5-1 > > Am I on the right track with this approach??? > > So for me to carry out the fetch process on multiple machines do I just copy > these segments onto the various machines and fetch the segments using the > 'Fetcher' class? > > Apologies if my questions seem strange - I'm just a little confused about > whether or not I'm on the right track and how to proceed from here.... > > Thanks, > Karen > > ----- Original Message ----- > From: "Andy Liu" <[EMAIL PROTECTED]> > To: <[email protected]> > Sent: Wednesday, July 06, 2005 2:36 PM > Subject: Re: Distributed Crawl > > > You can emit multiple fetchlists using the -numFetchers option, copy > each segment to a different machine to fetch, copy the segments back, > and run updatedb on all the segments. > > Andy > > On 7/6/05, Karen Church <[EMAIL PROTECTED]> wrote: > > Hi All, > > > > I was wondering if someone could point me in the right direction for > > carrying out a distributed crawl. Basically I was to split a crawl over a > > few machines. Is there a way of just 'fetching' the pages using multiple > > machines and then merging the results onto a single machine? Can I then > > run the Nutch indexing process over that single machine? > > > > Thanks > > Karen > > > > > > ------------------------------------------------------- SF.Net email is sponsored by: Discover Easy Linux Migration Strategies from IBM. Find simple to follow Roadmaps, straightforward articles, informative Webcasts and more! Get everything you need to get up to speed, fast. http://ads.osdn.com/?ad_idt77&alloc_id492&op=click _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
