[ 
https://issues.apache.org/jira/browse/HADOOP-9608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13671017#comment-13671017
 ] 

Todd Lipcon commented on HADOOP-9608:
-------------------------------------

bq. Do you mean because NNA's ZKFC does not have C's address or there is a 
fixed mapping of the ith NN to a server and that mapping is stale?

Right -- in this case even the namenode ID changed, so NNA's ZKFC didn't know 
what to do with the information in ZK which said it should go and fence the old 
namenode ID.

bq. The proposed solution is to make every standby ZKFC restart when it 
discovers an active leader that cannot be connected to? This would mean all 
standby NN would be rebooting in the above scenario when C becomes master, 
right?

Right - the standby NNs wouldn't themselves restart, but their ZKFCs would 
abort and require the admin to reconfigure and restart them. The general idea 
is that we shouldn't have a ZKFC in the election if, upon winning, it would 
fail to become active anyway.
                
> ZKFC should abort if it sees an unrecognized NN become active
> -------------------------------------------------------------
>
>                 Key: HADOOP-9608
>                 URL: https://issues.apache.org/jira/browse/HADOOP-9608
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: ha
>    Affects Versions: 3.0.0
>            Reporter: Todd Lipcon
>
> We recently had an issue where one NameNode and ZKFC was updated to a new 
> configuration/IP address but the ZKFC on the other node was not rebooted. 
> Then, next time a failover occurred, the second ZKFC was not able to become 
> active because the data in the ActiveBreadCrumb didn't match the data in its 
> own configuration:
> {code}
> org.apache.hadoop.ha.ActiveStandbyElector: Exception handling the winning of 
> election
> java.lang.IllegalArgumentException: Unable to determine service address for 
> namenode 'XXXX'
> {code}
> To prevent this from happening, whenever the ZKFC sees a new NN become 
> active, it should check that it's properly able to instantiate a 
> ServiceTarget for it, and if not, abort (since this ZKFC wouldn't be able to 
> handle a failover successfully)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to