I don't think the secondary namenodes are working throughout - so not
sure they are a factor. 

What I observed:

- stopped dfs. took backup copy of current/ directory
- restarted dfs with new 0.13.1
- after file system is back up - fsck says fs is corrupt. Large number
of files have blocks missing
- tried to stop, copy back backedup copy of current/ and restart with
0.13.0 - no effect - same corruption remains.

The logs attached before show what happened - as soon as 0.13.1 was
bought up:

- Primary Namenode seems to have gone through the edit log (I am
guessing). It spews out large number of errors saying 'cannot rename xxx
to yyy, yyy already exists').
- All the datanodes show large number of 'deleting block bbb' errors.

Without knowing the source, the logs would indicate that the primary
went through the edit log and somehow managed to instruct the datanodes
to delete blocks which caused corruption.

Would be happy to send more logs ur way (although the pattern remains
the same)if it helps,

Thanks,

Joydeep




-----Original Message-----
From: Konstantin Shvachko [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, August 28, 2007 1:39 PM
To: [email protected]
Subject: Re: secondary namenode errors

Could you please describe what is exactly the problem with upgrade.
If malfunctioned secondary name-node messes up with the image and/or 
edits files
then we should fix the problem asap.

Thanks,
Konstantin

Joydeep Sen Sarma wrote:

>Just in case someone's curious.
>
> 
>
>Stop and restart dfs with 0.13.1:
>
> 
>
>- master name node says:
>
> 
>
>2007-08-24 18:31:27,318 INFO org.apache.hadoop.dfs.NameNode: Namenode
up
>at: hadoop001.sf2p.facebook.com/10.16.159.101:9000
>
>2007-08-24 18:31:28,560 WARN org.apache.hadoop.dfs.StateChange: DIR*
>FSDirectory.unprotectedDelete: failed to remove /tmp/pu3 because
>
> it does not exist
>
>2007-08-24 18:31:28,571 WARN org.apache.hadoop.dfs.StateChange: DIR*
>FSDirectory.unprotectedRenameTo: failed to rename /user/facebook
>
>/chatter/rawcounts/2007-08-04/_task_0001_r_000044_0/part-00044 to
>/user/facebook/chatter/rawcounts/2007-08-04/part-00044 because dest
>
>ination exists
>
>2007-08-24 18:31:28,571 WARN org.apache.hadoop.dfs.StateChange: DIR*
>FSDirectory.unprotectedRenameTo: failed to rename /user/facebook
>
>/chatter/rawcounts/2007-08-04/_task_0001_r_000044_0/.part-00044.crc to
>/user/facebook/chatter/rawcounts/2007-08-04/.part-00044.crc be
>
>cause destination exists
>
>2007-08-24 18:31:28,572 WARN org.apache.hadoop.dfs.StateChange: DIR*
>FSDirectory.unprotectedRenameTo: failed to rename /user/facebook
>
>/chatter/rawcounts/2007-08-04/_task_0001_r_000040_0/part-00040 to
>/user/facebook/chatter/rawcounts/2007-08-04/part-00040 because dest
>
>ination exists
>
>2007-08-24 18:31:28,572 WARN org.apache.hadoop.dfs.StateChange: DIR*
>FSDirectory.unprotectedRenameTo: failed to rename /user/facebook
>
>/chatter/rawcounts/2007-08-04/_task_0001_r_000040_0/.part-00040.crc to
>/user/facebook/chatter/rawcounts/2007-08-04/.part-00040.crc be
>
>cause destination exists
>
>2007-08-24 18:31:28,573 WARN org.apache.hadoop.dfs.StateChange: DIR*
>FSDirectory.unprotectedRenameTo: failed to rename /user/facebook
>
>/chatter/rawcounts/2007-08-04/_task_0001_r_000052_0/part-00052 to
>/user/facebook/chatter/rawcounts/2007-08-04/part-00052 because dest
>
>ination exists
>
>...
>
> 
>
>there's a serious blast of these (replaying edit log?). In any case -
>after this is done - it enters safemode - presume the fs is corrupted
by
>then. At the exact same time - the datanodes are busy deleting blocks!:
>
> 
>
>2007-08-24 18:31:33,243 INFO org.apache.hadoop.dfs.DataNode: Starting
>DataNode in: FSDataset{dirpath='/var/hadoop/tmp/dfs/data/curren
>
>t'}
>
>2007-08-24 18:31:33,243 INFO org.apache.hadoop.dfs.DataNode: using
>BLOCKREPORT_INTERVAL of 3588023msec
>
>2007-08-24 18:31:34,252 INFO org.apache.hadoop.dfs.DataNode: Deleting
>block blk_-9223045762536565560 file /var/hadoop/tmp/dfs/data/cu
>
>rrent/subdir14/subdir18/blk_-9223045762536565560
>
>2007-08-24 18:31:34,269 INFO org.apache.hadoop.dfs.DataNode: Deleting
>block blk_-9214178286744587840 file /var/hadoop/tmp/dfs/data/cu
>
>rrent/subdir14/subdir12/blk_-9214178286744587840
>
>2007-08-24 18:31:34,370 INFO org.apache.hadoop.dfs.DataNode: Deleting
>block blk_-9213127144044535407 file /var/hadoop/tmp/dfs/data/cu
>
>rrent/subdir14/subdir20/blk_-9213127144044535407
>
>2007-08-24 18:31:34,386 INFO org.apache.hadoop.dfs.DataNode: Deleting
>block blk_-9211625398030978419 file /var/hadoop/tmp/dfs/data/cu
>
>rrent/subdir14/subdir26/blk_-9211625398030978419
>
>2007-08-24 18:31:34,418 INFO org.apache.hadoop.dfs.DataNode: Deleting
>block blk_-9189558923884323865 file /var/hadoop/tmp/dfs/data/cu
>
>rrent/subdir14/subdir24/blk_-9189558923884323865
>
>2007-08-24 18:31:34,419 INFO org.apache.hadoop.dfs.DataNode: Deleting
>block blk_-9115468136273900585 file /var/hadoop/tmp/dfs/data/cu
>
>rrent/subdir10/blk_-9115468136273900585
>
> 
>
> 
>
>ouch - I guess those are all the blocks that fsck is now reporting
>missing. Known bug? Operator error? (well - I did do a clean shutdown
>..)
>
> 
>
> 
>
>-----Original Message-----
>From: Joydeep Sen Sarma [mailto:[EMAIL PROTECTED] 
>Sent: Friday, August 24, 2007 7:21 PM
>To: [email protected]
>Subject: RE: secondary namenode errors
>
> 
>
>I wish I had read the bug more carefully - thought that the issue was
>
>fixed in 0.13.1.
>
> 
>
>Of course not, the issue persists. Meanwhile - half the files are
>
>corrupted after the upgrade (followed the upgrade wiki, tried to
restore
>
>to backed up metadata and old version - to no avail).
>
> 
>
>Sigh - have a nice weekend everyone,
>
> 
>
>Joydeep
>
> 
>
>-----Original Message-----
>
>From: Koji Noguchi [mailto:[EMAIL PROTECTED] 
>
>Sent: Friday, August 24, 2007 8:29 AM
>
>To: [email protected]
>
>Subject: Re: secondary namenode errors
>
> 
>
>Joydeep,
>
> 
>
>I think you're hitting this bug.
>
>http://issues.apache.org/jira/browse/HADOOP-1076
>
> 
>
>In any case, as Raghu suggested, please use 0.13.1 and not 0.13.
>
> 
>
>Koji
>
> 
>
> 
>
> 
>
> 
>
>Raghu Angadi wrote:
>
>  
>
>>Joydeep Sen Sarma wrote:
>>    
>>
>
>  
>
>>>Thanks for replying.
>>>      
>>>
>
>  
>
>
>  
>
>>>Can you please clarify - is it the case that the secondary namenode
>>>      
>>>
>
>  
>
>>>stuff only works in 0.13.1? and what's the connection with
>>>      
>>>
>
>replication
>
>  
>
>>>factor?
>>>      
>>>
>
>  
>
>
>  
>
>>>We lost the file system completely once, trying to make sure we can
>>>      
>>>
>
>  
>
>>>avoid it the next time.
>>>      
>>>
>
>  
>
>
>  
>
>>I am not sure if the problem you reported still exists in 0.13.1. You 
>>    
>>
>
>  
>
>>might still have the problem and you can ask again. But you should 
>>    
>>
>
>  
>
>>move to 0.13.1 since it has some critical fixes. See release notes for
>>    
>>
>
> 
>
>  
>
>>0.13.1 or HADOOP-1603. You should always upgrade to the latest minor 
>>    
>>
>
>  
>
>>release version when moving to next major version.
>>    
>>
>
>  
>
>
>  
>
>>Raghu.
>>    
>>
>
>  
>
>
>  
>
>>>Joydeep
>>>      
>>>
>
>  
>
>
>  
>
>>>-----Original Message-----
>>>      
>>>
>
>  
>
>>>From: Raghu Angadi [mailto:[EMAIL PROTECTED] Sent: Thursday, 
>>>      
>>>
>
>  
>
>>>August 23, 2007 9:44 PM
>>>      
>>>
>
>  
>
>>>To: [email protected]
>>>      
>>>
>
>  
>
>>>Subject: Re: secondary namenode errors
>>>      
>>>
>
>  
>
>
>  
>
>
>  
>
>>>On a related note, please don't use 0.13.0, use the latest released 
>>>      
>>>
>
>  
>
>>>version for 0.13 (I think it is 0.13.1). If the secondary namenode 
>>>      
>>>
>
>  
>
>>>actually works, then it will resulting all the replications set to 1.
>>>      
>>>
>
>  
>
>
>  
>
>>>Raghu.
>>>      
>>>
>
>  
>
>
>  
>
>>>Joydeep Sen Sarma wrote:
>>>      
>>>
>
>  
>
>>>>Hi folks,
>>>>        
>>>>
>
> 
>
>
>  
>

Reply via email to