Hi, I have a small cluster (6 nodes, 1 master and 5 region server/data nodes). Each node has lots of memory and disk (16GB of heap dedicated to RegionServers), 4 TB of disk per node for hdfs. I have a table with about 1 million rows in hbase - that's all. Currently it is split across 50 regions. I was monitoring this with the hbase web gui and I noticed that a lot of the heap was being used (14GB). I was running a MR job and I was getting an error to the console that launched the job: Error: GC overhead limit exceeded hbase
First question: is this going to hose the whole system? I didn't see the error in any of the hbase logs, so I assume that it was purely a client issue. So, naively thinking that maybe the GC had moved everything to permgen and just wasn't cleaning up, I thought I would do a rolling restart of my region servers and see if that cleared everything up. The first server I killed happened to be the one that was hosting the .META. table. Subsequently the web gui failed. Looking at the errors, it seems that the web gui essentially caches the address for the meta table and blindly tries connecting on every request. I suppose I could restart the master, but this does not seem like desirable behavior. Shouldn't the cache be refreshed on error? And since there is no real code for the GUI, just a jsp page, doesn't this mean that this behavior could be seen in other applications that use HMaster? Corrections welcome Dave