I have a question that i feel i should ask on this thread. Lets say you want to build a cluster where you will be doing very little map/reduce, storage and replication of data only on hdfs. What would the hardware requirements be? No quad core? less ram?
Thanks -Ryan On Thu, Oct 1, 2009 at 7:36 AM, tim robertson <timrobertson...@gmail.com>wrote: > Disclaimer: I am pretty useless when it comes to hardware > > I had a lot of issues with non ECC memory when running 100's millions > inserts from MapReduce into HBase on a dev cluster. The errors were > checksum errors, and the consensus was the memory was causing the > issues and all advice was to ensure ECC memory. The same cluster ran > without (any apparent) error for simple counting operations on tab > delimited files. > > Cheers, > Tim > > On Thu, Oct 1, 2009 at 11:49 AM, Steve Loughran <ste...@apache.org> wrote: > > Kevin Sweeney wrote: > >> > >> I really appreciate everyone's input. We've been going back and forth on > >> the > >> server size issue here. There are a few reasons we shot for the $1k > price, > >> one because we wanted to be able to compare our datacenter costs vs. the > >> cloud costs. Another is that we have spec'd out a fast Intel node with > >> over-the-counter parts. We have a hard time justifying the > dual-processor > >> costs and really don't see the need for the big server extras like > >> out-of-band management and redundancy. This is our proposed config, feel > >> free to criticize :) > >> Supermicro 512L-260 Chassis $90 > >> Supermicro X8SIL $160 > >> Heatsink $22 > >> Intel 3460 Xeon $350 > >> Samsung 7200 RPM SATA2 2x$85 > >> 2GB Non-ECC DIMM 4x$65 > >> > >> This totals $1052. Doesn't this seem like a reasonable setup? Isn't the > >> purpose of a hadoop cluster to build cheap,fast, replaceable nodes? > > > > Disclaimer 1: I work for a server vendor so may be biased. I will attempt > to > > avoid this by not pointing you at HP DL180 or SL170z servers. > > > > Disclaimer 2: I probably don't know what I'm talking about. As far as > Hadoop > > concerned, I'm not sure anyone knows what is "the right" configuration. > > > > * I'd consider ECC RAM. On a large cluster, over time, errors occur -you > > either notice them or propagate the effects. > > > > * Worry about power, cooling and rack weight. > > > > * Include network costs, power budget. That's your own switch costs, plus > > bandwidth in and out. > > > > * There are some good arguments in favour of fewer, higher end machines > over > > many smaller ones. Less network traffic, often a higher density. > > > > The cloud hosted vs owned is an interesting question; I suspect the > > spreadsheet there is pretty complex > > > > * Estimate how much data you will want to store over time. On S3, those > > costs ramp up fast; in your own rack you can maybe plan to stick in in an > > extra 2TB HDD a year from now (space, power, cooling and weight > permitting), > > paying next year's prices for next year's capacity. > > > > * Virtual machine management costs are different from physical management > > costs, especially if you dont invest time upfront on automating your > > datacentre software provisioning (custom RPMs, PXE preboot, kickstart, > etc). > > VMMs you can almost hand manage an image (naughty, but possible), as long > as > > you have a single image or two to push out. Even then, i'd automate, but > at > > a higher level, creating images on demand as load/availablity sees fit. > > > > -Steve > > > > > > >