[jira] Commented: (HADOOP-4810) Data lost at cluster startup time

Hairong Kuang (JIRA) Tue, 09 Dec 2008 12:16:09 -0800

    [ 
https://issues.apache.org/jira/browse/HADOOP-4810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12654938#action_12654938
 ]


Hairong Kuang commented on HADOOP-4810:
---------------------------------------

When invalid replicas (ones from A1.A2.A3.A4 and B1.B2.B3.B4) were reported, NN 
was able to detect that the replicas were corrupted and mark them as invalid by 
adding them to recentInvalidates. However the attempts to add them to 
recentInvalidates failed because of SafeMode. However, the replicas were also 
left in blocksMap. When NN were out of SafeMode, corrupted replicas were 
counted as valid ones and the block was treated as over-replicated. 
Unfortunately when removing excessive replicas, valid replicas (ones from 
C1.C2.C3.C4 and D1.D2.D3.D4) were chosen to be removed. Thus valid data got 
lost.

It seems to me that when corrupted replicas are first detected in block report, 
they should be put in CorruptedBlockMap instead of invalidating them directly. 
At least NN would not count them as valid ones so false over-replication won't 
happen. 

> Data lost at cluster startup time
> ---------------------------------
>
>                 Key: HADOOP-4810
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4810
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.18.2
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>            Priority: Blocker
>             Fix For: 0.18.3
>
>
> hadoop dfs -cat file1 returns
> dfs.DFSClient: Could not obtain block blk_XX_0 from any node: 
> java.io.IOException: No live nodes contain current block
> Tracing the history of the block from NN log, we found
>  WARN org.apache.hadoop.fs.FSNamesystem: Inconsistent size for block 
> blk_-6160940519231606858_0 reported from A1.A2.A3.A4:50010 current size is 
> 9303872 reported size is 262144
>  WARN org.apache.hadoop.fs.FSNamesystem: Deleting block 
> blk_-6160940519231606858_0 from A1.A2.A3.A4:50010
> INFO org.apache.hadoop.dfs.StateChange: DIR* NameSystem.invalidateBlock: 
> blk_-6160940519231606858_0 on A1.A2.A3.A4:50010 
> WARN org.apache.hadoop.fs.FSNamesystem: Error in deleting bad block 
> blk_-6160940519231606858_0 org.apache.hadoop.dfs.SafeModeException: Cannot 
> invalidate block blk_-6160940519231606858_0. Name node is in safe mode. 
> WARN org.apache.hadoop.fs.FSNamesystem: Inconsistent size for block 
> blk_-6160940519231606858_0 reported from B1.B2.B3.B4:50010 current size is 
> 9303872 reported size is 306688 
> WARN org.apache.hadoop.fs.FSNamesystem: Deleting block 
> blk_-6160940519231606858_0 from B1.B2.B3.B4:50010 
> INFO org.apache.hadoop.dfs.StateChange: DIR* NameSystem.invalidateBlock: 
> blk_-6160940519231606858_0 on B1.B2.B3.B4:50010 
> WARN org.apache.hadoop.fs.FSNamesystem: Error in deleting bad block 
> blk_-6160940519231606858_0 org.apache.hadoop.dfs.SafeModeException: Cannot 
> invalidate block blk_-6160940519231606858_0. Name node is in safe mode. 
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* 
> NameSystem.chooseExcessReplicates: (C1.C2.C3.C4:50010, 
> blk_-6160940519231606858_0) is added to recentInvalidateSets 
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* 
> NameSystem.chooseExcessReplicates: (D1.D2.D3.D4:50010, 
> blk_-6160940519231606858_0) is added to recentInvalidateSets
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* ask C1.C2.C3.C4:50010 to 
> delete blk_-6160940519231606858_0
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* ask D1.D2.D3.D4:50010 to 
> delete blk_-6160940519231606858_0

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4810) Data lost at cluster startup time

Reply via email to