Some quick thoughts (from a relative newbie): - if disk space is a problem you could mount some EBS volumes - building an AMI is really very easy: http://docs.amazonwebservices.com/AmazonEC2/dg/2006-06-26/creating-an-ami.html - why would you not use a cloudera image (or RPM) though and benefit from the work of others (testing, future upgrades, documentation, potential support etc) - if it were me, I would pick the EC2 size with equivalent memory that you plan to purchase in your machines to try the tuning options to be used for real - EC2 seems pretty slow on disk IO. My Hadoop cluster runs about 4x the EC2 speed and is made of Dell R200s with 2x500G SATA, and 8GB - for testing, perhaps start with saturating HBase with huge traffic simulating your app (e.g. read/write) and see how gracefully both HBase and the clients handle it, and then also under normal load, start dropping servers from the HBase cluster (including ZK) to see simulated failure (and then coming back up again).
On Wed, Apr 21, 2010 at 7:58 AM, Sean <seanatpur...@hotmail.com> wrote: > > Hi folks, > I am thinking of building a testing environment for a HBase cluster on EC2, > and I plan to build such an environment for the following reasons: > 1) To have a reference throughput/read_latency number for different size of > HBase cluster.2) To test various schema design and its performance > implication to scan and M/R operation. > -- After having result from 1 and 2, we can decide how to build actual > physical cluster. The reason that we don't want to build physical cluster at > the first place is because I understand that building a 4 nodes cluster does > not make too much sense for real load test (we do have a rough estimation of > how big our data size will be).-- At the same time, I hope I can have got > enough high-availability solution during our experimenting on 1 and 2. > Having said my motivation of this experiment, I'd like ask several questions: > a) After reading http://aws.amazon.com/ec2/instance-types/, I believe I > should select "Standard Instances: Extra Large Instance" as my instance. > Though it seems that I should pick "High-Memory Instances" family because we > are talking about memory hungry application here, "High-Memory Instances" > probably does not fit my testing environment -- the disk space does not look > like a good number. Note: after the testing at this environment, I will need > to use the benchmark number as a reference to build my actual cluster. > > b) I understand Cloudera provides an AMI, but can I build my own? If I can > choose to do so, can someone give me a pointer? I have successfully built an > HBase server on a 4 machine cluster, how much further effort (please give me > an estimate if you would) need I put to achieve this goal? > c) Here is my testing environment: -- I build an HBase cluster for serving > -- then I build several clients for issuing work-load opsHow can I get to > learn the high-availability lessons around this (I know most of the > high-level ideas, but all subtle issues come from implementation details as > we all know, especially for a distributed system) > > > Thanks for any suggestion! > > > > > _________________________________________________________________ > The New Busy is not the old busy. Search, chat and e-mail from your inbox. > http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_3