Just in case someone's curious.
Stop and restart dfs with 0.13.1:
- master name node says:
2007-08-24 18:31:27,318 INFO org.apache.hadoop.dfs.NameNode: Namenode up
at: hadoop001.sf2p.facebook.com/10.16.159.101:9000
2007-08-24 18:31:28,560 WARN org.apache.hadoop.dfs.StateChange: DIR*
FSDirectory.unprotectedDelete: failed to remove /tmp/pu3 because
it does not exist
2007-08-24 18:31:28,571 WARN org.apache.hadoop.dfs.StateChange: DIR*
FSDirectory.unprotectedRenameTo: failed to rename /user/facebook
/chatter/rawcounts/2007-08-04/_task_0001_r_000044_0/part-00044 to
/user/facebook/chatter/rawcounts/2007-08-04/part-00044 because dest
ination exists
2007-08-24 18:31:28,571 WARN org.apache.hadoop.dfs.StateChange: DIR*
FSDirectory.unprotectedRenameTo: failed to rename /user/facebook
/chatter/rawcounts/2007-08-04/_task_0001_r_000044_0/.part-00044.crc to
/user/facebook/chatter/rawcounts/2007-08-04/.part-00044.crc be
cause destination exists
2007-08-24 18:31:28,572 WARN org.apache.hadoop.dfs.StateChange: DIR*
FSDirectory.unprotectedRenameTo: failed to rename /user/facebook
/chatter/rawcounts/2007-08-04/_task_0001_r_000040_0/part-00040 to
/user/facebook/chatter/rawcounts/2007-08-04/part-00040 because dest
ination exists
2007-08-24 18:31:28,572 WARN org.apache.hadoop.dfs.StateChange: DIR*
FSDirectory.unprotectedRenameTo: failed to rename /user/facebook
/chatter/rawcounts/2007-08-04/_task_0001_r_000040_0/.part-00040.crc to
/user/facebook/chatter/rawcounts/2007-08-04/.part-00040.crc be
cause destination exists
2007-08-24 18:31:28,573 WARN org.apache.hadoop.dfs.StateChange: DIR*
FSDirectory.unprotectedRenameTo: failed to rename /user/facebook
/chatter/rawcounts/2007-08-04/_task_0001_r_000052_0/part-00052 to
/user/facebook/chatter/rawcounts/2007-08-04/part-00052 because dest
ination exists
...
there's a serious blast of these (replaying edit log?). In any case -
after this is done - it enters safemode - presume the fs is corrupted by
then. At the exact same time - the datanodes are busy deleting blocks!:
2007-08-24 18:31:33,243 INFO org.apache.hadoop.dfs.DataNode: Starting
DataNode in: FSDataset{dirpath='/var/hadoop/tmp/dfs/data/curren
t'}
2007-08-24 18:31:33,243 INFO org.apache.hadoop.dfs.DataNode: using
BLOCKREPORT_INTERVAL of 3588023msec
2007-08-24 18:31:34,252 INFO org.apache.hadoop.dfs.DataNode: Deleting
block blk_-9223045762536565560 file /var/hadoop/tmp/dfs/data/cu
rrent/subdir14/subdir18/blk_-9223045762536565560
2007-08-24 18:31:34,269 INFO org.apache.hadoop.dfs.DataNode: Deleting
block blk_-9214178286744587840 file /var/hadoop/tmp/dfs/data/cu
rrent/subdir14/subdir12/blk_-9214178286744587840
2007-08-24 18:31:34,370 INFO org.apache.hadoop.dfs.DataNode: Deleting
block blk_-9213127144044535407 file /var/hadoop/tmp/dfs/data/cu
rrent/subdir14/subdir20/blk_-9213127144044535407
2007-08-24 18:31:34,386 INFO org.apache.hadoop.dfs.DataNode: Deleting
block blk_-9211625398030978419 file /var/hadoop/tmp/dfs/data/cu
rrent/subdir14/subdir26/blk_-9211625398030978419
2007-08-24 18:31:34,418 INFO org.apache.hadoop.dfs.DataNode: Deleting
block blk_-9189558923884323865 file /var/hadoop/tmp/dfs/data/cu
rrent/subdir14/subdir24/blk_-9189558923884323865
2007-08-24 18:31:34,419 INFO org.apache.hadoop.dfs.DataNode: Deleting
block blk_-9115468136273900585 file /var/hadoop/tmp/dfs/data/cu
rrent/subdir10/blk_-9115468136273900585
ouch - I guess those are all the blocks that fsck is now reporting
missing. Known bug? Operator error? (well - I did do a clean shutdown
..)
-----Original Message-----
From: Joydeep Sen Sarma [mailto:[EMAIL PROTECTED]
Sent: Friday, August 24, 2007 7:21 PM
To: [email protected]
Subject: RE: secondary namenode errors
I wish I had read the bug more carefully - thought that the issue was
fixed in 0.13.1.
Of course not, the issue persists. Meanwhile - half the files are
corrupted after the upgrade (followed the upgrade wiki, tried to restore
to backed up metadata and old version - to no avail).
Sigh - have a nice weekend everyone,
Joydeep
-----Original Message-----
From: Koji Noguchi [mailto:[EMAIL PROTECTED]
Sent: Friday, August 24, 2007 8:29 AM
To: [email protected]
Subject: Re: secondary namenode errors
Joydeep,
I think you're hitting this bug.
http://issues.apache.org/jira/browse/HADOOP-1076
In any case, as Raghu suggested, please use 0.13.1 and not 0.13.
Koji
Raghu Angadi wrote:
> Joydeep Sen Sarma wrote:
>> Thanks for replying.
>>
>> Can you please clarify - is it the case that the secondary namenode
>> stuff only works in 0.13.1? and what's the connection with
replication
>> factor?
>>
>> We lost the file system completely once, trying to make sure we can
>> avoid it the next time.
>
> I am not sure if the problem you reported still exists in 0.13.1. You
> might still have the problem and you can ask again. But you should
> move to 0.13.1 since it has some critical fixes. See release notes for
> 0.13.1 or HADOOP-1603. You should always upgrade to the latest minor
> release version when moving to next major version.
>
> Raghu.
>
>> Joydeep
>>
>> -----Original Message-----
>> From: Raghu Angadi [mailto:[EMAIL PROTECTED] Sent: Thursday,
>> August 23, 2007 9:44 PM
>> To: [email protected]
>> Subject: Re: secondary namenode errors
>>
>>
>> On a related note, please don't use 0.13.0, use the latest released
>> version for 0.13 (I think it is 0.13.1). If the secondary namenode
>> actually works, then it will resulting all the replications set to 1.
>>
>> Raghu.
>>
>> Joydeep Sen Sarma wrote:
>>> Hi folks,