So it turns out the issue was just the size of the filesystem. 2012-12-27 16:37:22,390 WARN org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Checkpoint done. New Image Size: 4,354,340,042
Basically if the NN image size hits ~ 5,000,000,000 you get f'ed. So you need about 3x ram as your FSImage size. If you do not have enough you die a slow death. On Sun, Dec 23, 2012 at 9:40 PM, Suresh Srinivas <sur...@hortonworks.com>wrote: > Do not have access to my computer. Based on reading the previous email, I > do not see any thing suspicious on the list of objects in the histo live > dump. > > I would like to hear from you about if it continued to grow. One instance > of this I had seen in the past was related to weak reference related to > socket objects. I do not see that happening here though. > > Sent from phone > > On Dec 23, 2012, at 10:34 AM, Edward Capriolo <edlinuxg...@gmail.com> > wrote: > > > Tried this.. > > > > NameNode is still Ruining my Xmas on its slow death march to OOM. > > > > http://imagebin.org/240453 > > > > > > On Sat, Dec 22, 2012 at 10:23 PM, Suresh Srinivas < > sur...@hortonworks.com>wrote: > > > >> -XX:NewSize=1G -XX:MaxNewSize=1G >