The problem I referred to was also being addressed in a separate thread, thanks to the contributors to the mailing list and mostly to J-D and Stack.
I have recently upgraded to the 0.89.20100924 version and after more than 24 hours am very happy with the results. I think I must have missed the previous announcement of this particular version being available. (which list should I join to get such announcements?). In the end, I would like to thank them for their timely, relevant and ingenious efforts in response to the posts on the mailing list. -Matthew O n Sep 27, 2010, at 11:40 AM, Jean-Daniel Cryans wrote: >> That's funny. My understanding was, region servers were redundant >> inherently. If >> they are "semiredundant", there should be a root cause like some wrong >> settings >> or a bug. >> >> Could someone from HBase experts comment on this? > > 0.89 is a developer release, it should be treated as such (eg do > expect bugs) and this is the version used by Matthew. A newer release > candidate was posted here: > http://people.apache.org/~jdcryans/hbase-0.89.20100924-candidate-1/ > and this is the version we're using in production (and on a few other > clusters) at StumbleUpon. We can kill -9 region servers as much as we > want, and the cluster does recover. Previously there was an issue with > empty log files that's now fixed, also 0.89.2010830 introduced a > changed that's incompatible with the old way of recovering edits from > a failed region server, so if leftovers from a previous split are > present this can prevent the master from splitting logs at all (which > is another issue Matthew got in a separate thread). > > J-D
