[ 
https://issues.apache.org/jira/browse/HADOOP-4402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brian Bockelman resolved HADOOP-4402.
-------------------------------------

    Resolution: Invalid

> Namenode is unaware of FS corruption
> ------------------------------------
>
>                 Key: HADOOP-4402
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4402
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.18.1
>            Reporter: Brian Bockelman
>             Fix For: 0.18.2
>
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> I think the name node is not told when there is block corruption.
> I found a huge number of files corrupted when I restarted my namenode.  
> Digging through the datanode logs, I saw the following:
> 2008-10-13 03:30:44,266 INFO org.apache.hadoop.dfs.DataBlockScanner: 
> Reporting bad block blk_-54103619973430645_3038 to namenode.
> 2008-10-13 03:57:11,447 WARN org.apache.hadoop.dfs.DataBlockScanner: First 
> Verification failed for blk_7657563767222456337_3165. Exception : 
> java.io.IOException: Block blk_7657563767222456337_3165 is not valid.
>       at org.apache.hadoop.dfs.FSDataset.getBlockFile(FSDataset.java:716)
>       at org.apache.hadoop.dfs.FSDataset.getLength(FSDataset.java:704)
>       at org.apache.hadoop.dfs.DataNode$BlockSender.<init>(DataNode.java:1678)
>       at 
> org.apache.hadoop.dfs.DataBlockScanner.verifyBlock(DataBlockScanner.java:408)
>       at 
> org.apache.hadoop.dfs.DataBlockScanner.verifyFirstBlock(DataBlockScanner.java:474)
>       at org.apache.hadoop.dfs.DataBlockScanner.run(DataBlockScanner.java:565)
>       at java.lang.Thread.run(Thread.java:595)
> 2008-10-13 03:57:11,448 WARN org.apache.hadoop.dfs.DataBlockScanner: Second 
> Verification failed for blk_7657563767222456337_3165. Exception : 
> java.io.IOException: Block blk_7657563767222456337_3165 is not valid.
>       at org.apache.hadoop.dfs.FSDataset.getBlockFile(FSDataset.java:716)
>       at org.apache.hadoop.dfs.FSDataset.getLength(FSDataset.java:704)
>       at org.apache.hadoop.dfs.DataNode$BlockSender.<init>(DataNode.java:1678)
>       at 
> org.apache.hadoop.dfs.DataBlockScanner.verifyBlock(DataBlockScanner.java:408)
>       at 
> org.apache.hadoop.dfs.DataBlockScanner.verifyFirstBlock(DataBlockScanner.java:474)
>       at org.apache.hadoop.dfs.DataBlockScanner.run(DataBlockScanner.java:565)
>       at java.lang.Thread.run(Thread.java:595)
> 2008-10-13 03:57:11,448 INFO org.apache.hadoop.dfs.DataBlockScanner: 
> Reporting bad block blk_7657563767222456337_3165 to namenode.
> So, node099 found a bad block.  However, if I grep the namenode information 
> for that block:
> /scratch/hadoop/logs/hadoop-root-namenode-hadoop-name.log.2008-10-10:2008-10-10
>  20:21:20,002 INFO org.apache.hadoop.dfs.StateChange: BLOCK* 
> NameSystem.allocateBlock: 
> /user/uscms01/LoadTestDownload/LoadTest07_FNAL_01_MyQXiu5a22TJQlcB_508. 
> blk_7657563767222456337_3165
> /scratch/hadoop/logs/hadoop-root-namenode-hadoop-name.log.2008-10-10:2008-10-10
>  20:21:32,150 INFO org.apache.hadoop.dfs.StateChange: BLOCK* 
> NameSystem.addStoredBlock: blockMap updated: 172.16.1.110:50010 is added to 
> blk_7657563767222456337_3165 size 67108864
> /scratch/hadoop/logs/hadoop-root-namenode-hadoop-name.log.2008-10-10:2008-10-10
>  20:21:32,151 INFO org.apache.hadoop.dfs.StateChange: BLOCK* 
> NameSystem.addStoredBlock: blockMap updated: 172.16.1.99:50010 is added to 
> blk_7657563767222456337_3165 size 67108864
> /scratch/hadoop/logs/hadoop-root-namenode-hadoop-name.log.2008-10-12:2008-10-12
>  05:05:26,898 INFO org.apache.hadoop.dfs.StateChange: BLOCK* ask 
> 172.16.1.99:50010 to replicate blk_7657563767222456337_3165 to datanode(s) 
> 172.16.1.18:50010
> /scratch/hadoop/logs/hadoop-root-namenode-hadoop-name.log.2008-10-12:2008-10-12
>  05:05:40,742 INFO org.apache.hadoop.dfs.NameNode: Error report from 
> 172.16.1.99:50010: Can't send invalid block blk_7657563767222456337_3165
> /scratch/hadoop/logs/hadoop-root-namenode-hadoop-name.log.2008-10-12:2008-10-12
>  05:12:43,759 WARN org.apache.hadoop.fs.FSNamesystem: 
> PendingReplicationMonitor timed out block blk_7657563767222456337_3165
> To summarize:
> - Block is allocated and written successfully to node100, then replicated to 
> node099.
> - Name node asks node099 to replicate block to node018
> - Name node is told it can't send invalid block to node018!  A few minutes 
> later, the PendingReplicationMonitor times out
> - No new replications are launched!!!
> - Block is found to be corrupted on node099 a few days later.  Data node 
> claims to inform the namenode of this, but nothing is listed in the namenode 
> logs.
> - Block is suspiciously missing on node110 as well
> Perhaps there are a few bugs here?
> 1) Name node doesn't get notified of the corrupted blocks - even though the 
> datanode claims to!
> 2) On replication failure, no new replicas are created.
> 3) Corruption events are much, much too common.  We have a specific dataset 
> which the namenode now claims is mostly-corrupted (100/167 files); before I 
> restarted the namenode, we had jobs run against it continuously for the 
> entire weekend.  The jobs were all successful, and the binary data format 
> does internal integrity checks as it reads the files.  If the corruption was 
> real, jobs would have failed.
> I'm concerned that the corruption-detection systems of Hadoop are seriously 
> busted for me. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to