Hey Michal,

There was an issue in the past where ROOT would not be properly reassigned
if there was only a single server left.

https://issues.apache.org/jira/browse/HBASE-1908

But that was fixed back in 0.20.2.

Can you post the master log?

JG

-----Original Message-----
From: Michał Podsiadłowski [mailto:podsiadlow...@gmail.com] 
Sent: Friday, March 05, 2010 2:03 AM
To: hbase-user@hadoop.apache.org
Subject: Hmaster fails to detect and retry failed region assign attempt.

Hi hbase-users!

Yesterday we did quite important test. On our production environment we
introduced hbase as a webcache layer (first step in integrating it to our
env), and in controlled manner we tried to brake it ;). We started to stop
various elements starting from datanode, hregion etc.. Everyting was working
very nicely until my coworker started to simulated disaster - he shutdown
2/3 of our cluster - 2 datanodes/hregions from 3. It was still fine though
query times were significantly higher - which wasn't surprise.
Then one of the hregions was started by watchdog and just after is stoodup
my friend invoked stop. Regions already started to be migrated to this node
and one of them was assigned by hmaster, opened on hregion (there is a
message in a log) but confirmation didn't arrive to Hmaster. Region location
was not saved to meta and this state was sustained till hmaster and the same
all regions restart. We couldn't scan it or get any row from that region,
nor disabled the table. It looked to us like master gave up tring to assing
the region or it assumed that regions was successfullly assinged and opened.
I know that scenario we simulated was not "normal usecase but stil we think
that cluster should recouperate after some time even from such a disaster.
Just to clarify all data from this table were replicated so no blocks were
missing.
Our hbase is 0.20.3 from Cloudera and hadoop is 0.20.1 also clean Cloudera
release. (any patches are adviced ?)

Our cluster is consisted of 4 physical machines divided by with xen:
3 machines divided to   datanodes + hregions  - 4gb ram / zookeeper 512 mb
ram / our other app
+ 4th machine divided to namenode 2gb / secondar namenode 2gb / hmaster 1gb

Region causing problem is _old-home,,1267642988312.

Some logs you can find here:
fwdn2 - Region server that was stoped during assiging regions -
http://pastebin.com/uL48KCjd

Due to unknown reasons log from master is corrupted and after some point is
appears as @@@@@@@.. in vim, yesterday it was fine though.
What i saw there was something like this

10/03/04 11:24:18 INFO master.RegionManager: Assigning region
_old-home,,1267642988312 to fwdn1,60020,1267698243695
and then nothing more about this region.

Any help appreciated.

Thanks
Michal

Reply via email to