Hi everyone,

Our research group is planning to set up a cluster sufficient to crawl
around 1 billion single Web pages (estimated Brazilian Web size) for
academic purposes, maybe using Nutch. We currently have 4 boxes (16GB of
ram, 6 * 750 GB disks w/ 3 controllers, Quad-Core AMD Opteron processor),
and we are currently considering to buy more nodes. We have some questions
right now which some of you may help:

1) Is it better to buy less powerful nodes in order to have more nodes and
more parallelism, or is it better to have a smaller number of nodes
equivalent to the ones we currently have? I guess just 1 disk per controller
would help. I don't really know also if 16 GB of ram would be necessary. And
maybe a quad-core wouldn't be necessary too, maybe just a duo-core would be
sufficient. In your experiences, where would it be better to spend money on?
Ram, disk, processing, more nodes, everything?

2) How many nodes would it be necessary to perform a Web crawl of 1 billion
pages in about 1 month? Have you had any similar experiences? How many did
you use?

Thank you for any help! We are very interested in understanding Nutch and
collaborating in the future.

Reply via email to