Hi Raghu.

The only lead I have, is that my root mount has filled-up completely.

This in itself should not have caused the metadata corruption, as it has
been stored on another mount point, which had plenty of space.

But perhaps the fact that NameNode/SecNameNode didn't have enough space for
logs has caused this?

Unfortunately I was pressed in time to get the cluster up and running, and
didn't preserve the logs or the image.
If this happens again - I will surely do so.

Regards.

2009/5/5 Raghu Angadi <rang...@yahoo-inc.com>

>
> Stas,
>
> This is indeed a serious issue.
>
> Did you happen to store the the corrupt image? Can this be reproduced using
> the image?
>
> Usually you can recover manually from a corrupt or truncated image. But
> more importantly we want to find how it got in to this state.
>
> Raghu.
>
>
> Stas Oskin wrote:
>
>> Hi.
>>
>> This quite worry-some issue.
>>
>> Can anyone advice on this? I'm really concerned it could appear in
>> production, and cause a huge data loss.
>>
>> Is there any way to recover from this?
>>
>> Regards.
>>
>> 2009/5/5 Tamir Kamara <tamirkam...@gmail.com>
>>
>>  I didn't have a space problem which led to it (I think). The corruption
>>> started after I bounced the cluster.
>>> At the time, I tried to investigate what led to the corruption but didn't
>>> find anything useful in the logs besides this line:
>>> saveLeases found path
>>>
>>>
>>> /tmp/temp623789763/tmp659456056/_temporary_attempt_200904211331_0010_r_000002_0/part-00002
>>> but no matching entry in namespace
>>>
>>> I also tried to recover from the secondary name node files but the
>>> corruption my too wide-spread and I had to format.
>>>
>>> Tamir
>>>
>>> On Mon, May 4, 2009 at 4:48 PM, Stas Oskin <stas.os...@gmail.com> wrote:
>>>
>>>  Hi.
>>>>
>>>> Same conditions - where the space has run out and the fs got corrupted?
>>>>
>>>> Or it got corrupted by itself (which is even more worrying)?
>>>>
>>>> Regards.
>>>>
>>>> 2009/5/4 Tamir Kamara <tamirkam...@gmail.com>
>>>>
>>>>  I had the same problem a couple of weeks ago with 0.19.1. Had to
>>>>>
>>>> reformat
>>>
>>>> the cluster too...
>>>>>
>>>>> On Mon, May 4, 2009 at 3:50 PM, Stas Oskin <stas.os...@gmail.com>
>>>>>
>>>> wrote:
>>>
>>>> Hi.
>>>>>>
>>>>>> After rebooting the NameNode server, I found out the NameNode doesn't
>>>>>>
>>>>> start
>>>>>
>>>>>> anymore.
>>>>>>
>>>>>> The logs contained this error:
>>>>>> "FSNamesystem initialization failed"
>>>>>>
>>>>>>
>>>>>> I suspected filesystem corruption, so I tried to recover from
>>>>>> SecondaryNameNode. Problem is, it was completely empty!
>>>>>>
>>>>>> I had an issue that might have caused this - the root mount has run
>>>>>>
>>>>> out
>>>
>>>> of
>>>>>
>>>>>> space. But, both the NameNode and the SecondaryNameNode directories
>>>>>>
>>>>> were
>>>>
>>>>> on
>>>>>
>>>>>> another mount point with plenty of space there - so it's very strange
>>>>>>
>>>>> that
>>>>>
>>>>>> they were impacted in any way.
>>>>>>
>>>>>> Perhaps the logs, which were located on root mount and as a result,
>>>>>>
>>>>> could
>>>>
>>>>> not be written, have caused this?
>>>>>>
>>>>>>
>>>>>> To get back HDFS running, i had to format the HDFS (including
>>>>>>
>>>>> manually
>>>
>>>> erasing the files from DataNodes). While this reasonable in test
>>>>>> environment
>>>>>> - production-wise it would be very bad.
>>>>>>
>>>>>> Any idea why it happened, and what can be done to prevent it in the
>>>>>>
>>>>> future?
>>>>>
>>>>>> I'm using the stable 0.18.3 version of Hadoop.
>>>>>>
>>>>>> Thanks in advance!
>>>>>>
>>>>>>
>>
>

Reply via email to