Brian Bockelman wrote:
On May 17, 2010, at 5:25 AM, Steve Loughran wrote:

Brian Bockelman wrote:
On May 14, 2010, at 8:27 PM, Todd Lipcon wrote:
Hey Brian,

Yep, excessive GC definitely sounds like a likely culprit. I'm surprised you
didn't see OOMEs in the log, though.

We didn't until the third restart today.  I have no clue why we haven't seen 
this in the past 9 months of this cluster though...
Anyhow, it looks like this might have done the trick... the sysadmin is heading 
over to kick over a few errant datanodes, and we should be able to get out of 
safemode soon.  Luckily, it's a 4-day weekend in Europe and otherwise a Friday 
evening in the US, so there's only a few folks using it.
good thing we europeans have long weekends.


:) Indeed

If you want to monitor GC, I'd recommend adding -verbose:gc
-XX:+PrintGCDetails -XX:+PrintGCDateStamps to your java options -
occasionally useful for times like this.

What are your current GC options? Played with compressed object pointers yet?

I've been eyeballing them, but haven't had any chance yet.  We probably won't 
mess with them until we start to run out of RAM on the machine themselves.

This particular instance was a simple oversight - there was no need to try and 
fit the NN into a 1GB heap on a dedicated machine.  I tell folks 1GB RAM per 1M 
objects.  It's almost always an over-estimate but, better safe than deadlocked 
on a Friday evening...


I've been using compressed pointers on JRockit for a long time, a very nice JVM that doesn't ever seem to run out of stack when you accidentally tail recurse without end. The Sun JVM pointers are newer, not had any problems with that part of the JVM, and the benefits in both memory consumption and possibly in cache hits make it very appealing.

Reply via email to