Had problems sending, resending. On Tue, Sep 23, 2008 at 6:33 PM, Guilherme Menezes < [EMAIL PROTECTED]> wrote:
> Hi everyone, > > Our research group is planning to set up a cluster sufficient to crawl > around 1 billion single Web pages (estimated Brazilian Web size) for > academic purposes, maybe using Nutch. We currently have 4 boxes (16GB of > ram, 6 * 750 GB disks w/ 3 controllers, Quad-Core AMD Opteron processor), > and we are currently considering to buy more nodes. We have some questions > right now which some of you may help: > > 1) Is it better to buy less powerful nodes in order to have more nodes and > more parallelism, or is it better to have a smaller number of nodes > equivalent to the ones we currently have? I guess just 1 disk per controller > would help. I don't really know also if 16 GB of ram would be necessary. And > maybe a quad-core wouldn't be necessary too, maybe just a duo-core would be > sufficient. In your experiences, where would it be better to spend money on? > Ram, disk, processing, more nodes, everything? > > 2) How many nodes would it be necessary to perform a Web crawl of 1 billion > pages in about 1 month? Have you had any similar experiences? How many did > you use? > > Thank you for any help! We are very interested in understanding Nutch and > collaborating in the future. >
