Hi,
I have a small cluster (6 nodes, 1 master and 5 region server/data nodes).  
Each node has lots of memory and disk (16GB of heap dedicated to 
RegionServers), 4 TB of disk per node for hdfs.
I have a table with about 1 million rows in hbase - that's all.  Currently it 
is split across 50 regions.
I was monitoring this with the hbase web gui and I noticed that a lot of the 
heap was being used (14GB).  I was running a MR job and I was getting an error 
to the console that launched the job:
Error: GC overhead limit exceeded hbase

First question: is this going to hose the whole system?  I didn't see the error 
in any of the hbase logs, so I assume that it was purely a client issue.

So, naively thinking that maybe the GC had moved everything to permgen and just 
wasn't cleaning up, I thought I would do a rolling restart of my region servers 
and see if that cleared everything up.  The first server I killed happened to 
be the one that was hosting the .META. table.  Subsequently the web gui failed. 
 Looking at the errors, it seems that the web gui essentially caches the 
address for the meta table and blindly tries connecting on every request.  I 
suppose I could restart the master, but this does not seem like desirable 
behavior.  Shouldn't the cache be refreshed on error?  And since there is no 
real code for the GUI, just a jsp page, doesn't this mean that this behavior 
could be seen in other applications that use HMaster?

Corrections welcome
Dave

Reply via email to