Hi Raghu, The thread you posted is my original post written when this problem first happened on my cluster. I can file a JIRA but I wouldn't be able to provide information other than what I already posted and I don't have the logs from that time. Should I still file ?
Thanks, Tamir On Tue, May 5, 2009 at 9:14 PM, Raghu Angadi <rang...@yahoo-inc.com> wrote: > Tamir, > > Please file a jira on the problem you are seeing with 'saveLeases'. In the > past there have been multiple fixes in this area (HADOOP-3418, HADOOP-3724, > and more mentioned in HADOOP-3724). > > Also refer the thread you started > http://www.mail-archive.com/core-user@hadoop.apache.org/msg09397.html > > I think another user reported the same problem recently. > > These are indeed very serious and very annoying bugs. > > Raghu. > > > Tamir Kamara wrote: > >> I didn't have a space problem which led to it (I think). The corruption >> started after I bounced the cluster. >> At the time, I tried to investigate what led to the corruption but didn't >> find anything useful in the logs besides this line: >> saveLeases found path >> >> /tmp/temp623789763/tmp659456056/_temporary_attempt_200904211331_0010_r_000002_0/part-00002 >> but no matching entry in namespace >> >> I also tried to recover from the secondary name node files but the >> corruption my too wide-spread and I had to format. >> >> Tamir >> >> On Mon, May 4, 2009 at 4:48 PM, Stas Oskin <stas.os...@gmail.com> wrote: >> >> Hi. >>> >>> Same conditions - where the space has run out and the fs got corrupted? >>> >>> Or it got corrupted by itself (which is even more worrying)? >>> >>> Regards. >>> >>> 2009/5/4 Tamir Kamara <tamirkam...@gmail.com> >>> >>> I had the same problem a couple of weeks ago with 0.19.1. Had to >>>> reformat >>>> the cluster too... >>>> >>>> On Mon, May 4, 2009 at 3:50 PM, Stas Oskin <stas.os...@gmail.com> >>>> wrote: >>>> >>>> Hi. >>>>> >>>>> After rebooting the NameNode server, I found out the NameNode doesn't >>>>> >>>> start >>>> >>>>> anymore. >>>>> >>>>> The logs contained this error: >>>>> "FSNamesystem initialization failed" >>>>> >>>>> >>>>> I suspected filesystem corruption, so I tried to recover from >>>>> SecondaryNameNode. Problem is, it was completely empty! >>>>> >>>>> I had an issue that might have caused this - the root mount has run out >>>>> >>>> of >>>> >>>>> space. But, both the NameNode and the SecondaryNameNode directories >>>>> >>>> were >>> >>>> on >>>> >>>>> another mount point with plenty of space there - so it's very strange >>>>> >>>> that >>>> >>>>> they were impacted in any way. >>>>> >>>>> Perhaps the logs, which were located on root mount and as a result, >>>>> >>>> could >>> >>>> not be written, have caused this? >>>>> >>>>> >>>>> To get back HDFS running, i had to format the HDFS (including manually >>>>> erasing the files from DataNodes). While this reasonable in test >>>>> environment >>>>> - production-wise it would be very bad. >>>>> >>>>> Any idea why it happened, and what can be done to prevent it in the >>>>> >>>> future? >>>> >>>>> I'm using the stable 0.18.3 version of Hadoop. >>>>> >>>>> Thanks in advance! >>>>> >>>>> >> >