Brian Bockelman wrote:
On May 14, 2010, at 8:27 PM, Todd Lipcon wrote:
Hey Brian,
Yep, excessive GC definitely sounds like a likely culprit. I'm surprised you
didn't see OOMEs in the log, though.
We didn't until the third restart today. I have no clue why we haven't seen
this in the past 9 months of this cluster though...
Anyhow, it looks like this might have done the trick... the sysadmin is heading
over to kick over a few errant datanodes, and we should be able to get out of
safemode soon. Luckily, it's a 4-day weekend in Europe and otherwise a Friday
evening in the US, so there's only a few folks using it.
good thing we europeans have long weekends.
If you want to monitor GC, I'd recommend adding -verbose:gc
-XX:+PrintGCDetails -XX:+PrintGCDateStamps to your java options -
occasionally useful for times like this.
What are your current GC options? Played with compressed object pointers
yet?