[ 
https://issues.apache.org/jira/browse/HDFS-6289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13984611#comment-13984611
 ] 

Todd Lipcon commented on HDFS-6289:
-----------------------------------

Can you double check that this test isn't made more flaky by this patch? I've 
seen this test fail once or twice before in precommits, but given that it's 
very much related to the code touched by this patch, we should probably 
investigate it a bit before committing.

Otherwise +1

> HA failover can fail if there are pending DN messages for DNs which no longer 
> exist
> -----------------------------------------------------------------------------------
>
>                 Key: HDFS-6289
>                 URL: https://issues.apache.org/jira/browse/HDFS-6289
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: ha
>    Affects Versions: 2.4.0
>            Reporter: Aaron T. Myers
>            Assignee: Aaron T. Myers
>            Priority: Critical
>         Attachments: HDFS-6289.patch, HDFS-6289.patch
>
>
> In an HA setup, the standby NN may receive messages from DNs for blocks which 
> the standby NN is not yet aware of. It queues up these messages and replays 
> them when it next reads from the edit log or fails over. On a failover, all 
> of these pending DN messages must be processed successfully in order for the 
> failover to succeed. If one of these pending DN messages refers to a DN 
> storageId that no longer exists (because the DN with that transfer address 
> has been reformatted and has re-registered with the same transfer address) 
> then on transition to active the NN will not be able to process this DN 
> message and will suicide with an error like the following:
> {noformat}
> 2014-04-25 14:23:17,922 FATAL namenode.NameNode 
> (NameNode.java:doImmediateShutdown(1525)) - Error encountered requiring NN 
> shutdown. Shutting down immediately.
> java.io.IOException: Cannot mark 
> blk_1073741825_900(stored=blk_1073741825_1001) as corrupt because datanode 
> 127.0.0.1:33324 does not exist
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to