Newer 1.6 are getting close to 1.7 so I am not going to fear a number and fight the future.
I have been aat around 27 million files for a while been as high as 30 million I do not think that is related. I do not think it is related to checkpoints but I am considering raising/lowering the checkpoint triggers. On Saturday, December 22, 2012, Joep Rottinghuis <jrottingh...@gmail.com> wrote: > Do your OOMs correlate with the secondary checkpointing? > > Joep > > Sent from my iPhone > > On Dec 22, 2012, at 7:42 AM, Michael Segel <michael_se...@hotmail.com> wrote: > >> Hey Silly question... >> >> How long have you had 27 million files? >> >> I mean can you correlate the number of files to the spat of OOMs? >> >> Even without problems... I'd say it would be a good idea to upgrade due to the probability of a lot of code fixes... >> >> If you're running anything pre 1.x, going to 1.7 java wouldn't be a good idea. Having said that... outside of MapR, have any of the distros certified themselves on 1.7 yet? >> >> On Dec 22, 2012, at 6:54 AM, Edward Capriolo <edlinuxg...@gmail.com> wrote: >> >>> I will give this a go. I have actually went in JMX and manually triggered >>> GC no memory is returned. So I assumed something was leaking. >>> >>> On Fri, Dec 21, 2012 at 11:59 PM, Adam Faris <afa...@linkedin.com> wrote: >>> >>>> I know this will sound odd, but try reducing your heap size. We had an >>>> issue like this where GC kept falling behind and we either ran out of heap >>>> or would be in full gc. By reducing heap, we were forcing concurrent mark >>>> sweep to occur and avoided both full GC and running out of heap space as >>>> the JVM would collect objects more frequently. >>>> >>>> On Dec 21, 2012, at 8:24 PM, Edward Capriolo <edlinuxg...@gmail.com> >>>> wrote: >>>> >>>>> I have an old hadoop 0.20.2 cluster. Have not had any issues for a while. >>>>> (which is why I never bothered an upgrade) >>>>> >>>>> Suddenly it OOMed last week. Now the OOMs happen periodically. We have a >>>>> fairly large NameNode heap Xmx 17GB. It is a fairly large FS about >>>>> 27,000,000 files. >>>>> >>>>> So the strangest thing is that every 1 and 1/2 hour the NN memory usage >>>>> increases until the heap is full. >>>>> >>>>> http://imagebin.org/240287 >>>>> >>>>> We tried failing over the NN to another machine. We change the Java >>>> version >>>>> from 1.6_23 -> 1.7.0. >>>>> >>>>> I have set the NameNode logs to debug and ALL and I have done the same >>>> with >>>>> the data nodes. >>>>> Secondary NN is running and shipping edits and making new images. >>>>> >>>>> I am thinking something has corrupted the NN MetaData and after enough >>>> time >>>>> it becomes a time bomb, but this is just a total shot in the dark. Does >>>>> anyone have any interesting trouble shooting ideas? >> >