Hi Raghu. The only lead I have, is that my root mount has filled-up completely.
This in itself should not have caused the metadata corruption, as it has been stored on another mount point, which had plenty of space. But perhaps the fact that NameNode/SecNameNode didn't have enough space for logs has caused this? Unfortunately I was pressed in time to get the cluster up and running, and didn't preserve the logs or the image. If this happens again - I will surely do so. Regards. 2009/5/5 Raghu Angadi <rang...@yahoo-inc.com> > > Stas, > > This is indeed a serious issue. > > Did you happen to store the the corrupt image? Can this be reproduced using > the image? > > Usually you can recover manually from a corrupt or truncated image. But > more importantly we want to find how it got in to this state. > > Raghu. > > > Stas Oskin wrote: > >> Hi. >> >> This quite worry-some issue. >> >> Can anyone advice on this? I'm really concerned it could appear in >> production, and cause a huge data loss. >> >> Is there any way to recover from this? >> >> Regards. >> >> 2009/5/5 Tamir Kamara <tamirkam...@gmail.com> >> >> I didn't have a space problem which led to it (I think). The corruption >>> started after I bounced the cluster. >>> At the time, I tried to investigate what led to the corruption but didn't >>> find anything useful in the logs besides this line: >>> saveLeases found path >>> >>> >>> /tmp/temp623789763/tmp659456056/_temporary_attempt_200904211331_0010_r_000002_0/part-00002 >>> but no matching entry in namespace >>> >>> I also tried to recover from the secondary name node files but the >>> corruption my too wide-spread and I had to format. >>> >>> Tamir >>> >>> On Mon, May 4, 2009 at 4:48 PM, Stas Oskin <stas.os...@gmail.com> wrote: >>> >>> Hi. >>>> >>>> Same conditions - where the space has run out and the fs got corrupted? >>>> >>>> Or it got corrupted by itself (which is even more worrying)? >>>> >>>> Regards. >>>> >>>> 2009/5/4 Tamir Kamara <tamirkam...@gmail.com> >>>> >>>> I had the same problem a couple of weeks ago with 0.19.1. Had to >>>>> >>>> reformat >>> >>>> the cluster too... >>>>> >>>>> On Mon, May 4, 2009 at 3:50 PM, Stas Oskin <stas.os...@gmail.com> >>>>> >>>> wrote: >>> >>>> Hi. >>>>>> >>>>>> After rebooting the NameNode server, I found out the NameNode doesn't >>>>>> >>>>> start >>>>> >>>>>> anymore. >>>>>> >>>>>> The logs contained this error: >>>>>> "FSNamesystem initialization failed" >>>>>> >>>>>> >>>>>> I suspected filesystem corruption, so I tried to recover from >>>>>> SecondaryNameNode. Problem is, it was completely empty! >>>>>> >>>>>> I had an issue that might have caused this - the root mount has run >>>>>> >>>>> out >>> >>>> of >>>>> >>>>>> space. But, both the NameNode and the SecondaryNameNode directories >>>>>> >>>>> were >>>> >>>>> on >>>>> >>>>>> another mount point with plenty of space there - so it's very strange >>>>>> >>>>> that >>>>> >>>>>> they were impacted in any way. >>>>>> >>>>>> Perhaps the logs, which were located on root mount and as a result, >>>>>> >>>>> could >>>> >>>>> not be written, have caused this? >>>>>> >>>>>> >>>>>> To get back HDFS running, i had to format the HDFS (including >>>>>> >>>>> manually >>> >>>> erasing the files from DataNodes). While this reasonable in test >>>>>> environment >>>>>> - production-wise it would be very bad. >>>>>> >>>>>> Any idea why it happened, and what can be done to prevent it in the >>>>>> >>>>> future? >>>>> >>>>>> I'm using the stable 0.18.3 version of Hadoop. >>>>>> >>>>>> Thanks in advance! >>>>>> >>>>>> >> >