> That's funny. My understanding was, region servers were redundant inherently. > If > they are "semiredundant", there should be a root cause like some wrong > settings > or a bug. > > Could someone from HBase experts comment on this?
0.89 is a developer release, it should be treated as such (eg do expect bugs) and this is the version used by Matthew. A newer release candidate was posted here: http://people.apache.org/~jdcryans/hbase-0.89.20100924-candidate-1/ and this is the version we're using in production (and on a few other clusters) at StumbleUpon. We can kill -9 region servers as much as we want, and the cluster does recover. Previously there was an issue with empty log files that's now fixed, also 0.89.2010830 introduced a changed that's incompatible with the old way of recovering edits from a failed region server, so if leftovers from a previous split are present this can prevent the master from splitting logs at all (which is another issue Matthew got in a separate thread). J-D
