Re: secondary namenode errors

Konstantin Shvachko Tue, 28 Aug 2007 13:39:24 -0700

Could you please describe what is exactly the problem with upgrade.

If malfunctioned secondary name-node messes up with the image and/oredits files

then we should fix the problem asap.


Thanks,
Konstantin

Joydeep Sen Sarma wrote:

Just in case someone's curious.



Stop and restart dfs with 0.13.1:



- master name node says:



2007-08-24 18:31:27,318 INFO org.apache.hadoop.dfs.NameNode: Namenode up
at: hadoop001.sf2p.facebook.com/10.16.159.101:9000

2007-08-24 18:31:28,560 WARN org.apache.hadoop.dfs.StateChange: DIR*
FSDirectory.unprotectedDelete: failed to remove /tmp/pu3 because

it does not exist

2007-08-24 18:31:28,571 WARN org.apache.hadoop.dfs.StateChange: DIR*
FSDirectory.unprotectedRenameTo: failed to rename /user/facebook

/chatter/rawcounts/2007-08-04/_task_0001_r_000044_0/part-00044 to
/user/facebook/chatter/rawcounts/2007-08-04/part-00044 because dest

ination exists

2007-08-24 18:31:28,571 WARN org.apache.hadoop.dfs.StateChange: DIR*
FSDirectory.unprotectedRenameTo: failed to rename /user/facebook

/chatter/rawcounts/2007-08-04/_task_0001_r_000044_0/.part-00044.crc to
/user/facebook/chatter/rawcounts/2007-08-04/.part-00044.crc be

cause destination exists

2007-08-24 18:31:28,572 WARN org.apache.hadoop.dfs.StateChange: DIR*
FSDirectory.unprotectedRenameTo: failed to rename /user/facebook

/chatter/rawcounts/2007-08-04/_task_0001_r_000040_0/part-00040 to
/user/facebook/chatter/rawcounts/2007-08-04/part-00040 because dest

ination exists

2007-08-24 18:31:28,572 WARN org.apache.hadoop.dfs.StateChange: DIR*
FSDirectory.unprotectedRenameTo: failed to rename /user/facebook

/chatter/rawcounts/2007-08-04/_task_0001_r_000040_0/.part-00040.crc to
/user/facebook/chatter/rawcounts/2007-08-04/.part-00040.crc be

cause destination exists

2007-08-24 18:31:28,573 WARN org.apache.hadoop.dfs.StateChange: DIR*
FSDirectory.unprotectedRenameTo: failed to rename /user/facebook

/chatter/rawcounts/2007-08-04/_task_0001_r_000052_0/part-00052 to
/user/facebook/chatter/rawcounts/2007-08-04/part-00052 because dest

ination exists

...



there's a serious blast of these (replaying edit log?). In any case -
after this is done - it enters safemode - presume the fs is corrupted by
then. At the exact same time - the datanodes are busy deleting blocks!:



2007-08-24 18:31:33,243 INFO org.apache.hadoop.dfs.DataNode: Starting
DataNode in: FSDataset{dirpath='/var/hadoop/tmp/dfs/data/curren

t'}

2007-08-24 18:31:33,243 INFO org.apache.hadoop.dfs.DataNode: using
BLOCKREPORT_INTERVAL of 3588023msec

2007-08-24 18:31:34,252 INFO org.apache.hadoop.dfs.DataNode: Deleting
block blk_-9223045762536565560 file /var/hadoop/tmp/dfs/data/cu

rrent/subdir14/subdir18/blk_-9223045762536565560

2007-08-24 18:31:34,269 INFO org.apache.hadoop.dfs.DataNode: Deleting
block blk_-9214178286744587840 file /var/hadoop/tmp/dfs/data/cu

rrent/subdir14/subdir12/blk_-9214178286744587840

2007-08-24 18:31:34,370 INFO org.apache.hadoop.dfs.DataNode: Deleting
block blk_-9213127144044535407 file /var/hadoop/tmp/dfs/data/cu

rrent/subdir14/subdir20/blk_-9213127144044535407

2007-08-24 18:31:34,386 INFO org.apache.hadoop.dfs.DataNode: Deleting
block blk_-9211625398030978419 file /var/hadoop/tmp/dfs/data/cu

rrent/subdir14/subdir26/blk_-9211625398030978419

2007-08-24 18:31:34,418 INFO org.apache.hadoop.dfs.DataNode: Deleting
block blk_-9189558923884323865 file /var/hadoop/tmp/dfs/data/cu

rrent/subdir14/subdir24/blk_-9189558923884323865

2007-08-24 18:31:34,419 INFO org.apache.hadoop.dfs.DataNode: Deleting
block blk_-9115468136273900585 file /var/hadoop/tmp/dfs/data/cu

rrent/subdir10/blk_-9115468136273900585





ouch - I guess those are all the blocks that fsck is now reporting
missing. Known bug? Operator error? (well - I did do a clean shutdown
..)





-----Original Message-----

From: Joydeep Sen Sarma [mailto:[EMAIL PROTECTED]Sent: Friday, August 24, 2007 7:21 PM

To: [email protected]
Subject: RE: secondary namenode errors

I wish I had read the bug more carefully - thought that the issue was

fixed in 0.13.1.

Of course not, the issue persists. Meanwhile - half the files are

corrupted after the upgrade (followed the upgrade wiki, tried to restore

to backed up metadata and old version - to no avail).

Sigh - have a nice weekend everyone,

Joydeep

-----Original Message-----

From: Koji Noguchi [mailto:[EMAIL PROTECTED]

Sent: Friday, August 24, 2007 8:29 AM

To: [email protected]

Subject: Re: secondary namenode errors

Joydeep,

I think you're hitting this bug.

http://issues.apache.org/jira/browse/HADOOP-1076

In any case, as Raghu suggested, please use 0.13.1 and not 0.13.

Koji

Raghu Angadi wrote:

Joydeep Sen Sarma wrote:

Thanks for replying.

Can you please clarify - is it the case that the secondary namenode

stuff only works in 0.13.1? and what's the connection with


replication

factor?

We lost the file system completely once, trying to make sure we can

avoid it the next time.

I am not sure if the problem you reported still exists in 0.13.1. You

might still have the problem and you can ask again. But you should

move to 0.13.1 since it has some critical fixes. See release notes for

0.13.1 or HADOOP-1603. You should always upgrade to the latest minor

release version when moving to next major version.

Raghu.

Joydeep

-----Original Message-----

From: Raghu Angadi [mailto:[EMAIL PROTECTED] Sent: Thursday,

August 23, 2007 9:44 PM

To: [email protected]

Subject: Re: secondary namenode errors

On a related note, please don't use 0.13.0, use the latest released

version for 0.13 (I think it is 0.13.1). If the secondary namenode

actually works, then it will resulting all the replications set to 1.

Raghu.

Joydeep Sen Sarma wrote:

Hi folks,

Re: secondary namenode errors

Reply via email to