I thought I might share back with the users my experience in getting HBase running on a small, 4 node cluster. I ran into a lot of trouble in getting started, some because of bugs and some specific to my use case. My learnings I think will hopefully be valuable to new users.
First of all, let me compliment the amazing group of folks developing HBase. Also, I'd like to say that we owe a lot to the amazing strategy Powerset has taken as a company to propel the development of their product, both leveraging and contributing to open source - what you guys are doing is nothing short than amazing! My basic use case is to persist a large (and growing) sparce dataset and enable constant incremental re-computation. In order to test performance for this use case, it was important to load a test initial dataset - roughly 220 million rows and 6 columns (for now, I'll say columns generically - I'll get it to strategy of column families). Some of my learnings - "Commodity Hardware" is relative. When I first heard the term, I (and many others I know) considered this to be on the order of desktop-grade machines - the machines I'd purchased were Dual Core 2+ Ghz Dell Desktops (purchased on eBay for $350 a piece). Well, you can definitely do certain tasks within the framework with these types of machines, but an ideal configuration consists of something much stronger - server-grade, quad core, 8Gigs of RAM, etc. HBase (particularly if you are going to do a lot of writes), needs really good Machine IO . If you are going to try to use machines with slow drives and controllers, it might be possible if you have a ton of datanodes, but not as advisable on smaller clusters. - Ideally, you should always insure that there is one processor available for the region server daemon and at least 2 processors for tasktracker (or 1 if you limit the map and reduce tasks to one each), if you are going to run heavy map/reduce jobs. The trouble with not doing so is that until 0.2, when there will be better load balancing on regionservers, it's always possible that a single region server can be called on to shoulder the full load of all tasktrackers. If you have a large write operations happening, you could otherwise cause splits and/or compaction to take too long (expensive operations) and cause your job to crawl to near halt if your lucky, or die completely. This means, if you're only using dual core machines, I'd suggest that at least during heavy data-writing periods, you consider running either regionservers or tasktracker, but probably not both on the same machine. - All machines should run the datanode - this helps the regionservers to distribute the IO load better. That way, when an expensive operation like compaction starts, it's spread over more machines. Also, Hadoop can localize frequently used files, to some degree. - Running bin/stop-hbase.sh can sometimes take a long time. Sometimes, regionservers are waiting for a lease to expire. There are a few times when there are dead processes (especially if you didn't take the earlier suggestions) so check the logs (.out), but often you just need to wait longer and it's worth it. - If you are writing from a MR job, it's most beneficial to find the right balance of number of tasks. Too many tasks means too many splits, startup and commits. Too few, and your region servers don't get the benefit of a break (the time it takes to commit and initialize a new task) - not too mention less to repeat after a failure. - Use the new release candidate 0.1.2 #1. It has a number of fixes in it that help for issues related to small clusters - I don't regard prior releases as usable for those of us being cheap about hardware. - Don't be afraid to adjust which daemons you run on which machines. For example, for my first large (initial) load, I shutdown all but a couple of tasktrackers and started up more region servers, whereas in normal operation, that ratio will probably be flipped. - Watch the number of regions you have on any particular regionserver. I'm in the process at the moment of testing how far you can push this, but the big concern is OOME - and unless you're running the latest release candidate, you're going to have big problems after an OOME. Hope this is helpful, St^ack, please feel free to point out where I'm wrong. :-) Danny PS. Thanks again to St^ack. He went over and above the call of duty to help me and it's bought a ton of confidence I now have in this project.
