Just in case someone's curious.

 

Stop and restart dfs with 0.13.1:

 

- master name node says:

 

2007-08-24 18:31:27,318 INFO org.apache.hadoop.dfs.NameNode: Namenode up
at: hadoop001.sf2p.facebook.com/10.16.159.101:9000

2007-08-24 18:31:28,560 WARN org.apache.hadoop.dfs.StateChange: DIR*
FSDirectory.unprotectedDelete: failed to remove /tmp/pu3 because

 it does not exist

2007-08-24 18:31:28,571 WARN org.apache.hadoop.dfs.StateChange: DIR*
FSDirectory.unprotectedRenameTo: failed to rename /user/facebook

/chatter/rawcounts/2007-08-04/_task_0001_r_000044_0/part-00044 to
/user/facebook/chatter/rawcounts/2007-08-04/part-00044 because dest

ination exists

2007-08-24 18:31:28,571 WARN org.apache.hadoop.dfs.StateChange: DIR*
FSDirectory.unprotectedRenameTo: failed to rename /user/facebook

/chatter/rawcounts/2007-08-04/_task_0001_r_000044_0/.part-00044.crc to
/user/facebook/chatter/rawcounts/2007-08-04/.part-00044.crc be

cause destination exists

2007-08-24 18:31:28,572 WARN org.apache.hadoop.dfs.StateChange: DIR*
FSDirectory.unprotectedRenameTo: failed to rename /user/facebook

/chatter/rawcounts/2007-08-04/_task_0001_r_000040_0/part-00040 to
/user/facebook/chatter/rawcounts/2007-08-04/part-00040 because dest

ination exists

2007-08-24 18:31:28,572 WARN org.apache.hadoop.dfs.StateChange: DIR*
FSDirectory.unprotectedRenameTo: failed to rename /user/facebook

/chatter/rawcounts/2007-08-04/_task_0001_r_000040_0/.part-00040.crc to
/user/facebook/chatter/rawcounts/2007-08-04/.part-00040.crc be

cause destination exists

2007-08-24 18:31:28,573 WARN org.apache.hadoop.dfs.StateChange: DIR*
FSDirectory.unprotectedRenameTo: failed to rename /user/facebook

/chatter/rawcounts/2007-08-04/_task_0001_r_000052_0/part-00052 to
/user/facebook/chatter/rawcounts/2007-08-04/part-00052 because dest

ination exists

...

 

there's a serious blast of these (replaying edit log?). In any case -
after this is done - it enters safemode - presume the fs is corrupted by
then. At the exact same time - the datanodes are busy deleting blocks!:

 

2007-08-24 18:31:33,243 INFO org.apache.hadoop.dfs.DataNode: Starting
DataNode in: FSDataset{dirpath='/var/hadoop/tmp/dfs/data/curren

t'}

2007-08-24 18:31:33,243 INFO org.apache.hadoop.dfs.DataNode: using
BLOCKREPORT_INTERVAL of 3588023msec

2007-08-24 18:31:34,252 INFO org.apache.hadoop.dfs.DataNode: Deleting
block blk_-9223045762536565560 file /var/hadoop/tmp/dfs/data/cu

rrent/subdir14/subdir18/blk_-9223045762536565560

2007-08-24 18:31:34,269 INFO org.apache.hadoop.dfs.DataNode: Deleting
block blk_-9214178286744587840 file /var/hadoop/tmp/dfs/data/cu

rrent/subdir14/subdir12/blk_-9214178286744587840

2007-08-24 18:31:34,370 INFO org.apache.hadoop.dfs.DataNode: Deleting
block blk_-9213127144044535407 file /var/hadoop/tmp/dfs/data/cu

rrent/subdir14/subdir20/blk_-9213127144044535407

2007-08-24 18:31:34,386 INFO org.apache.hadoop.dfs.DataNode: Deleting
block blk_-9211625398030978419 file /var/hadoop/tmp/dfs/data/cu

rrent/subdir14/subdir26/blk_-9211625398030978419

2007-08-24 18:31:34,418 INFO org.apache.hadoop.dfs.DataNode: Deleting
block blk_-9189558923884323865 file /var/hadoop/tmp/dfs/data/cu

rrent/subdir14/subdir24/blk_-9189558923884323865

2007-08-24 18:31:34,419 INFO org.apache.hadoop.dfs.DataNode: Deleting
block blk_-9115468136273900585 file /var/hadoop/tmp/dfs/data/cu

rrent/subdir10/blk_-9115468136273900585

 

 

ouch - I guess those are all the blocks that fsck is now reporting
missing. Known bug? Operator error? (well - I did do a clean shutdown
..)

 

 

-----Original Message-----
From: Joydeep Sen Sarma [mailto:[EMAIL PROTECTED] 
Sent: Friday, August 24, 2007 7:21 PM
To: [email protected]
Subject: RE: secondary namenode errors

 

I wish I had read the bug more carefully - thought that the issue was

fixed in 0.13.1.

 

Of course not, the issue persists. Meanwhile - half the files are

corrupted after the upgrade (followed the upgrade wiki, tried to restore

to backed up metadata and old version - to no avail).

 

Sigh - have a nice weekend everyone,

 

Joydeep

 

-----Original Message-----

From: Koji Noguchi [mailto:[EMAIL PROTECTED] 

Sent: Friday, August 24, 2007 8:29 AM

To: [email protected]

Subject: Re: secondary namenode errors

 

Joydeep,

 

I think you're hitting this bug.

http://issues.apache.org/jira/browse/HADOOP-1076

 

In any case, as Raghu suggested, please use 0.13.1 and not 0.13.

 

Koji

 

 

 

 

Raghu Angadi wrote:

> Joydeep Sen Sarma wrote:

>> Thanks for replying.

>> 

>> Can you please clarify - is it the case that the secondary namenode

>> stuff only works in 0.13.1? and what's the connection with

replication

>> factor?

>> 

>> We lost the file system completely once, trying to make sure we can

>> avoid it the next time.

> 

> I am not sure if the problem you reported still exists in 0.13.1. You 

> might still have the problem and you can ask again. But you should 

> move to 0.13.1 since it has some critical fixes. See release notes for

 

> 0.13.1 or HADOOP-1603. You should always upgrade to the latest minor 

> release version when moving to next major version.

> 

> Raghu.

> 

>> Joydeep

>> 

>> -----Original Message-----

>> From: Raghu Angadi [mailto:[EMAIL PROTECTED] Sent: Thursday, 

>> August 23, 2007 9:44 PM

>> To: [email protected]

>> Subject: Re: secondary namenode errors

>> 

>> 

>> On a related note, please don't use 0.13.0, use the latest released 

>> version for 0.13 (I think it is 0.13.1). If the secondary namenode 

>> actually works, then it will resulting all the replications set to 1.

>> 

>> Raghu.

>> 

>> Joydeep Sen Sarma wrote:

>>> Hi folks,

 

Reply via email to