Dave, Can you pastebin the exact error that was returned by the MR job? That looks like it's client-side (from HBase point of view).
WRT the .META. and the master, the web page does do a request on every hit so if the region is unavailable then you can't see it. Looks like you kill -9'ed the region server? If so, it takes a minute to detect the region server failure and then split the write-ahead-logs so if .META. was on that machine, it will take that much time to have a working web page. Instead of kill -9, simply go on the node and run ./bin/hbase-daemon.sh stop regionserver J-D On Wed, Mar 31, 2010 at 5:51 PM, Buttler, David <buttl...@llnl.gov> wrote: > Hi, > I have a small cluster (6 nodes, 1 master and 5 region server/data nodes). > Each node has lots of memory and disk (16GB of heap dedicated to > RegionServers), 4 TB of disk per node for hdfs. > I have a table with about 1 million rows in hbase - that's all. Currently it > is split across 50 regions. > I was monitoring this with the hbase web gui and I noticed that a lot of the > heap was being used (14GB). I was running a MR job and I was getting an > error to the console that launched the job: > Error: GC overhead limit exceeded hbase > > First question: is this going to hose the whole system? I didn't see the > error in any of the hbase logs, so I assume that it was purely a client issue. > > So, naively thinking that maybe the GC had moved everything to permgen and > just wasn't cleaning up, I thought I would do a rolling restart of my region > servers and see if that cleared everything up. The first server I killed > happened to be the one that was hosting the .META. table. Subsequently the > web gui failed. Looking at the errors, it seems that the web gui essentially > caches the address for the meta table and blindly tries connecting on every > request. I suppose I could restart the master, but this does not seem like > desirable behavior. Shouldn't the cache be refreshed on error? And since > there is no real code for the GUI, just a jsp page, doesn't this mean that > this behavior could be seen in other applications that use HMaster? > > Corrections welcome > Dave > >