Hey Michal, There was an issue in the past where ROOT would not be properly reassigned if there was only a single server left.
https://issues.apache.org/jira/browse/HBASE-1908 But that was fixed back in 0.20.2. Can you post the master log? JG -----Original Message----- From: Michał Podsiadłowski [mailto:podsiadlow...@gmail.com] Sent: Friday, March 05, 2010 2:03 AM To: hbase-user@hadoop.apache.org Subject: Hmaster fails to detect and retry failed region assign attempt. Hi hbase-users! Yesterday we did quite important test. On our production environment we introduced hbase as a webcache layer (first step in integrating it to our env), and in controlled manner we tried to brake it ;). We started to stop various elements starting from datanode, hregion etc.. Everyting was working very nicely until my coworker started to simulated disaster - he shutdown 2/3 of our cluster - 2 datanodes/hregions from 3. It was still fine though query times were significantly higher - which wasn't surprise. Then one of the hregions was started by watchdog and just after is stoodup my friend invoked stop. Regions already started to be migrated to this node and one of them was assigned by hmaster, opened on hregion (there is a message in a log) but confirmation didn't arrive to Hmaster. Region location was not saved to meta and this state was sustained till hmaster and the same all regions restart. We couldn't scan it or get any row from that region, nor disabled the table. It looked to us like master gave up tring to assing the region or it assumed that regions was successfullly assinged and opened. I know that scenario we simulated was not "normal usecase but stil we think that cluster should recouperate after some time even from such a disaster. Just to clarify all data from this table were replicated so no blocks were missing. Our hbase is 0.20.3 from Cloudera and hadoop is 0.20.1 also clean Cloudera release. (any patches are adviced ?) Our cluster is consisted of 4 physical machines divided by with xen: 3 machines divided to datanodes + hregions - 4gb ram / zookeeper 512 mb ram / our other app + 4th machine divided to namenode 2gb / secondar namenode 2gb / hmaster 1gb Region causing problem is _old-home,,1267642988312. Some logs you can find here: fwdn2 - Region server that was stoped during assiging regions - http://pastebin.com/uL48KCjd Due to unknown reasons log from master is corrupted and after some point is appears as @@@@@@@.. in vim, yesterday it was fine though. What i saw there was something like this 10/03/04 11:24:18 INFO master.RegionManager: Assigning region _old-home,,1267642988312 to fwdn1,60020,1267698243695 and then nothing more about this region. Any help appreciated. Thanks Michal