On May 17, 2010, at 5:25 AM, Steve Loughran wrote:

> Brian Bockelman wrote:
>> On May 14, 2010, at 8:27 PM, Todd Lipcon wrote:
>>> Hey Brian,
>>> 
>>> Yep, excessive GC definitely sounds like a likely culprit. I'm surprised you
>>> didn't see OOMEs in the log, though.
>>> 
>> We didn't until the third restart today.  I have no clue why we haven't seen 
>> this in the past 9 months of this cluster though...
>> Anyhow, it looks like this might have done the trick... the sysadmin is 
>> heading over to kick over a few errant datanodes, and we should be able to 
>> get out of safemode soon.  Luckily, it's a 4-day weekend in Europe and 
>> otherwise a Friday evening in the US, so there's only a few folks using it.
> 
> good thing we europeans have long weekends.
> 

:) Indeed

>>> If you want to monitor GC, I'd recommend adding -verbose:gc
>>> -XX:+PrintGCDetails -XX:+PrintGCDateStamps to your java options -
>>> occasionally useful for times like this.
>>> 
> 
> What are your current GC options? Played with compressed object pointers yet?

I've been eyeballing them, but haven't had any chance yet.  We probably won't 
mess with them until we start to run out of RAM on the machine themselves.

This particular instance was a simple oversight - there was no need to try and 
fit the NN into a 1GB heap on a dedicated machine.  I tell folks 1GB RAM per 1M 
objects.  It's almost always an over-estimate but, better safe than deadlocked 
on a Friday evening...

Brian

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to