Please take a histo live dump when the memory is full. Note that this causes full gc. http://docs.oracle.com/javase/6/docs/technotes/tools/share/jmap.html
What are the number of blocks you have on the system. Send the JVM options you are using. From earlier java versions which used 1/8 of total heap for young gen, it has gone upto 1/3 of total heap. This could also be the reason. Do you collect gc logs? Send that as well. Sent from a mobile device On Dec 22, 2012, at 9:51 AM, Edward Capriolo <edlinuxg...@gmail.com> wrote: > Newer 1.6 are getting close to 1.7 so I am not going to fear a number and > fight the future. > > I have been aat around 27 million files for a while been as high as 30 > million I do not think that is related. > > I do not think it is related to checkpoints but I am considering > raising/lowering the checkpoint triggers. > > On Saturday, December 22, 2012, Joep Rottinghuis <jrottingh...@gmail.com> > wrote: >> Do your OOMs correlate with the secondary checkpointing? >> >> Joep >> >> Sent from my iPhone >> >> On Dec 22, 2012, at 7:42 AM, Michael Segel <michael_se...@hotmail.com> > wrote: >> >>> Hey Silly question... >>> >>> How long have you had 27 million files? >>> >>> I mean can you correlate the number of files to the spat of OOMs? >>> >>> Even without problems... I'd say it would be a good idea to upgrade due > to the probability of a lot of code fixes... >>> >>> If you're running anything pre 1.x, going to 1.7 java wouldn't be a good > idea. Having said that... outside of MapR, have any of the distros > certified themselves on 1.7 yet? >>> >>> On Dec 22, 2012, at 6:54 AM, Edward Capriolo <edlinuxg...@gmail.com> > wrote: >>> >>>> I will give this a go. I have actually went in JMX and manually > triggered >>>> GC no memory is returned. So I assumed something was leaking. >>>> >>>> On Fri, Dec 21, 2012 at 11:59 PM, Adam Faris <afa...@linkedin.com> > wrote: >>>> >>>>> I know this will sound odd, but try reducing your heap size. We had > an >>>>> issue like this where GC kept falling behind and we either ran out of > heap >>>>> or would be in full gc. By reducing heap, we were forcing concurrent > mark >>>>> sweep to occur and avoided both full GC and running out of heap space > as >>>>> the JVM would collect objects more frequently. >>>>> >>>>> On Dec 21, 2012, at 8:24 PM, Edward Capriolo <edlinuxg...@gmail.com> >>>>> wrote: >>>>> >>>>>> I have an old hadoop 0.20.2 cluster. Have not had any issues for a > while. >>>>>> (which is why I never bothered an upgrade) >>>>>> >>>>>> Suddenly it OOMed last week. Now the OOMs happen periodically. We > have a >>>>>> fairly large NameNode heap Xmx 17GB. It is a fairly large FS about >>>>>> 27,000,000 files. >>>>>> >>>>>> So the strangest thing is that every 1 and 1/2 hour the NN memory > usage >>>>>> increases until the heap is full. >>>>>> >>>>>> http://imagebin.org/240287 >>>>>> >>>>>> We tried failing over the NN to another machine. We change the Java >>>>> version >>>>>> from 1.6_23 -> 1.7.0. >>>>>> >>>>>> I have set the NameNode logs to debug and ALL and I have done the same >>>>> with >>>>>> the data nodes. >>>>>> Secondary NN is running and shipping edits and making new images. >>>>>> >>>>>> I am thinking something has corrupted the NN MetaData and after enough >>>>> time >>>>>> it becomes a time bomb, but this is just a total shot in the dark. > Does >>>>>> anyone have any interesting trouble shooting ideas? >>