[jira] Commented: (HBASE-706) On OOME, regionserver sticks around and doesn't go down with cluster

stack (JIRA) Mon, 07 Jul 2008 11:57:23 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12611290#action_12611290
 ]


stack commented on HBASE-706:
-----------------------------

In a previous application, we'd set aside a bit of memory to release when 
application hit an OOME.  The reservation was done on startup.  It was a linked 
list of sizeable blocks rather than one big monolithic block, probably so 
allocation would still work in a fragmented heap.  In that apps' case, default 
was a single block of 5M.  Maybe in hbase, set aside more?  20M?  In 4 blocks?  
Make it configurable?  What you think? Default hbase heap is 1G I believe (See 
the bin/hbase script)

Main loop in the application was wrapped in a try/catch.  On 'serious error', 
we'd first release the memory resevoir -- the release could be run by more than 
one thread so some times it'd be a noop -- and then run code to set the 
application into a safe 'park' so could be analyzed later by operator.  In our 
case, things would be a little trickier because there is more than just the one 
loop.    The OOME could bubble out in the main master/regionserver loops or in 
one of the service thread loops.  You'd have to plug in the OOME processing 
into all places (some inherit from Chore so you could add processing there).  
We also want our regionserver to go all the ways down if it hit an OOME to 
minimize damage done. 

I'd imagine that all you'd do is if OOME, release the memory and then let the 
shutdown proceed normally.  Hopefully, the very release of the resevoir should 
be sufficient to making a successful shutdown.

Tests will be hard.   There is an OOMERegionServer.  You might play with that.  
You probably won't be able to inline it as unit test.  Thats OK, I think.  
Also, I don't think its possible to write a handler that will work in all 
cases, just most: e.g. there may be a pathological case where the just-release 
resevoir gets eaten up immediately but a rampant thread.   I think we'll just 
have to make up a patch that does the above, commit it and then watch how well 
it does out in the field.



> On OOME, regionserver sticks around and doesn't go down with cluster
> --------------------------------------------------------------------
>
>                 Key: HBASE-706
>                 URL: https://issues.apache.org/jira/browse/HBASE-706
>             Project: Hadoop HBase
>          Issue Type: Bug
>            Reporter: stack
>             Fix For: 0.2.0
>
>
> On John Gray cluster, an errant, massive, store file caused us OOME.  
> Shutdown of cluster left this regionserver in place. A thread dump failed 
> with OOME.  Here is last thing in log:
> {code}
> 2008-06-25 03:21:55,111 INFO org.apache.hadoop.hbase.HRegionServer: worker 
> thread exiting
> 2008-06-25 03:24:26,923 FATAL org.apache.hadoop.hbase.HRegionServer: Set stop 
> flag in regionserver/0:0:0:0:0:0:0:0:60020.cacheFlusher
> java.lang.OutOfMemoryError: Java heap space
>         at java.util.HashMap.<init>(HashMap.java:226)
>         at java.util.HashSet.<init>(HashSet.java:103)
>         at 
> org.apache.hadoop.hbase.HRegionServer.getRegionsToCheck(HRegionServer.java:1789)
>         at 
> org.apache.hadoop.hbase.HRegionServer$Flusher.enqueueOptionalFlushRegions(HRegionServer.java:479)
>         at 
> org.apache.hadoop.hbase.HRegionServer$Flusher.run(HRegionServer.java:385)
> 2008-06-25 03:24:26,923 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
> 2 on 60020, call batchUpdate(items,,1214272763124, 9223372036854775807, 
> [EMAIL PROTECTED]) from 192.168.249.230:38278: error: java.io.IOException: 
> Server not running
> java.io.IOException: Server not running
>         at 
> org.apache.hadoop.hbase.HRegionServer.checkOpen(HRegionServer.java:1758)
>         at 
> org.apache.hadoop.hbase.HRegionServer.batchUpdate(HRegionServer.java:1547)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:616)
>         at org.apache.hadoop.hbase.ipc.HbaseRPC$Server.call(HbaseRPC.java:413)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:901)
> {code}
> If I get an OOME just trying to threaddump, would seem to indicate we need to 
> start keeping a little memory resevoir around for emergencies such as this 
> just so we can shutdown clean.
> Moving this into 0.2.  Seems important to fix if robustness is name of the 
> game.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-706) On OOME, regionserver sticks around and doesn't go down with cluster

Reply via email to