Thanks for your comments! Our research lab is mostly focused on NLP and IR. So we are aiming at good throughput and also a reasonable storage capacity ~3/4TB.
-- Sérgio Nunes On Wed, Mar 5, 2008 at 6:51 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > > The right answer really depends on your workload and what your needs and > goals are. > > You say that this is a research lab. If you are researching parallel > algorithms, then I would recommend much higher parallelism. > > If you are working on problems where you want throughput, then the answer > may be a bit different. In that case, the two major considerations are > aggregate disk speed (proportional to number of drives/interfaces) and > aggregate CPU speed. Much of my work load is disk limited so I find having > more machines each with a disk to saturate is a good idea. Having > completely anemic CPU's is not very helpful, however. > > Assuming that you are only concerned with purchase cost, I would tend to > recommend single CPU, dual core machines with a decently fast 64 bit CPU > (opteron or xeon), each with 500GB drives. Depending on your luck, you may > be able to get dual CPU's for 4 cores per box for a similar price. Getting > two slightly smaller disks would probably give you better throughput for > very slighly higher cost. > > If you are considering life-cycle costs then you may come up with slightly > different configurations due to rack density. Blades don't generally have > very large disks so to get to 5TB, you may require a lot of blades. > > Homoegeneity is not a huge issue. I have a cluster with 4 pretty hot Xeon > cores on some boxes and 2 lousy cores on other boxes and the long tail > phenomenon does not come up all that much because the file splits are fine > enough that it all works out in the end. > > If this is for learning about parallel processing and if somebody else is > paying your power bills you should consider getting used machines or > machines from a place like Dell outlet. Many of the machines that you get > that way are considerably lower cost and would provide comparable > disk/network bandwidth to new machines. Ebay has bunches of Dell 1850's for > sale, for instance. > > > > On 3/5/08 7:16 AM, "S. Nunes" <[EMAIL PROTECTED]> wrote: > > > Hi, > > > > I'm trying to deploy a small Hadoop cluster for our research lab. > > We are in the process of selecting the hardware for this cluster. We > > are aiming at a 12 CPU, 5 TB cluster. This is obviously a very rough > > estimation. > > > > I have a few questions and I would greatly appreciate your feedback. > > > > Which is better, a cluster based on many low performance nodes; or a > > cluster with fewer but high performance nodes? For instance, should I > > bet on a cluster with 4 nodes (1 CPU + 100 GB each) or on a cluster > > with 2 nodes (2 CPU + 200 GB each)? > > > > What should be considered regarding node homogeneity? I understand > > that a very unbalanced cluster would result in a "long tailed" > > performance - slower nodes would penalize the overall performance. > > However, how critical is that? Do you have performance numbers to > > support our decision? > > > > Finally, do you recommend any specific hardware configuration for > > starting a cluster (rack, blade, tower...) ? > > > > Thanks in advance for your comments, > > > > -- > > Sérgio Nunes > >
