[jira] [Commented] (AMBARI-13396) RU: Handle Namenode being down scenarios

Alejandro Fernandez (JIRA) Tue, 13 Oct 2015 12:31:26 -0700

    [ 
https://issues.apache.org/jira/browse/AMBARI-13396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14955510#comment-14955510
 ]


Alejandro Fernandez commented on AMBARI-13396:
----------------------------------------------

The current logic is as follows. Assume initially,
NameNode 1 active
NameNode 2 standby

Ambari will schedule first to run on NameNode 2, then on NameNode 1.

1. Sunny case: Both NameNodes are up.
* We hit the standby NN2, and try to restart it. The "stop" function will call 
initiate_safe_zkfc_failover(), which will wait for the active to become the 
standby (which is the bug since not always true).
* On NN 2, it figures out it is the standby, so a no-op
* On NN 1, it figures out it is the active, so it initiates a failover, and 
waits for it to become the standby.

2. NameNode 1 dies:
* On NN2, which is now the active, and we hit first.
Try to restart it. The "stop" function will call initiate_safe_zkfc_failover(), 
which will call the failover command (and if that fails, kill ZKFC), and then 
wait for it to become the standby. *However, NN1 is down, so it's impossible 
for NN1 to become the standby.*
 * On NN1: the stop command will call initiate_safe_zkfc_failover(), which will 
fail since this NameNode is down.

3. NameNode 2 dies:
* On NN2, the stop command will call initiate_safe_zkfc_failover(), which *will 
fail* since this NameNode is down. In order to proceed, the user must start 
this NameNode, which will be brought up as the *standby*.
* On NN1, which is the active, the stop command will call 
initiate_safe_zkfc_failover(), and attempt to failover successfully.

4. Both die:
* On NN2, the stop command will call initiate_safe_zkfc_failover(), which *will 
fail* since this NameNode is down. In order to proceed, the user must start 
this NameNode manually, which will be brought up as the *active*. Once they 
retry, NN1 is still down, so NN2 will stay as the active.
* On NN1, which is down, the stop command will call 
initiate_safe_zkfc_failover(), and *fail* to determine the HA state.

Generally, we need to make the failover a best-attempt, and if failover works, 
then wait for its state to change from active->standby only if the other 
NameNode is healthy and can become the active.

In the function initiate_safe_zkfc_failover, if the "getServiceState" command 
fails, then likely the namenode is down, so we should continue without an error.
When we call the failover command, we should still resort to the backup of 
killing ZKFC.
Today, there's a bug in setting "wait_for_standby" since we also need to check 
that the other namenode is up, otherwise, ignore and proceed.

> RU: Handle Namenode being down scenarios
> ----------------------------------------
>
>                 Key: AMBARI-13396
>                 URL: https://issues.apache.org/jira/browse/AMBARI-13396
>             Project: Ambari
>          Issue Type: Bug
>          Components: ambari-server
>    Affects Versions: 2.1.2
>            Reporter: Jayush Luniya
>            Assignee: Jayush Luniya
>             Fix For: 2.1.3
>
>         Attachments: AMBARI-13396.patch
>
>
> There are 2 scenarios that need to be handled during RU
> *Setup:*
> * host1 : namenode1, host2 :namenode2
> * namenode1 on node1 is down
> *Scenario 1: During RU, namenode1 on host1 is going to be upgraded before 
> namenode2 on host2*
> Since namenode1 on host1 is already down, namenode2 is the active namenode. 
> So we  should fix the logic to simply restart namenode1 as namenode2 will 
> remain active.
> *Scenario 2: During RU, namenode2 on host2 is going to be upgraded before 
> namenode1 on host1*
> Since namenode2 on host2 is active, then we should fail, since there isn't 
> another namenode instance that can become active. However today we do the 
> following: 
> # Call "hdfs haadmin -failover nn2 nn1" which will fail since nn1 is not 
> healthy.
> # When this command fails, we kill ZKFC on this host and then we wait for 
> this instance to come back as standby which will never happen because this 
> instance will come back as active. 
> We should simply fail when "haadmin failover" command fails instead of 
> killing ZKFC.
> {noformat}
> 2015-10-12 22:35:15,307 - Rolling Upgrade - Initiating a ZKFC failover on 
> active NameNode host jay-ams-2.c.pramod-thangali.internal.
> 2015-10-12 22:35:15,308 - call['hdfs haadmin -failover nn2 nn1'] 
> {'logoutput': True, 'user': 'hdfs'}
> Operation failed: NameNode at 
> jay-ams-1.c.pramod-thangali.internal/10.240.0.178:8020 is not currently 
> healthy. Cannot be failover target
>       at 
> org.apache.hadoop.ha.ZKFailoverController.checkEligibleForFailover(ZKFailoverController.java:698)
>       at 
> org.apache.hadoop.ha.ZKFailoverController.doGracefulFailover(ZKFailoverController.java:632)
>       at 
> org.apache.hadoop.ha.ZKFailoverController.access$400(ZKFailoverController.java:61)
>       at 
> org.apache.hadoop.ha.ZKFailoverController$3.run(ZKFailoverController.java:604)
>       at 
> org.apache.hadoop.ha.ZKFailoverController$3.run(ZKFailoverController.java:601)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:422)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
>       at 
> org.apache.hadoop.ha.ZKFailoverController.gracefulFailoverToYou(ZKFailoverController.java:601)
>       at 
> org.apache.hadoop.ha.ZKFCRpcServer.gracefulFailover(ZKFCRpcServer.java:94)
>       at 
> org.apache.hadoop.ha.protocolPB.ZKFCProtocolServerSideTranslatorPB.gracefulFailover(ZKFCProtocolServerSideTranslatorPB.java:61)
>       at 
> org.apache.hadoop.ha.proto.ZKFCProtocolProtos$ZKFCProtocolService$2.callBlockingMethod(ZKFCProtocolProtos.java:1548)
>       at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
>       at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
>       at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2137)
>       at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2133)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:422)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
>       at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2131)
> 2015-10-12 22:35:17,748 - call returned (255, 'Operation failed: NameNode at 
> jay-ams-1.c.pramod-thangali.internal/10.240.0.178:8020 is not currently 
> healthy. Cannot be failover target\n\tat 
> org.apache.hadoop.ha.ZKFailoverController.checkEligibleForFailover(ZKFailoverController.java:698)\n\tat
>  
> org.apache.hadoop.ha.ZKFailoverController.doGracefulFailover(ZKFailoverController.java:632)\n\tat
>  
> org.apache.hadoop.ha.ZKFailoverController.access$400(ZKFailoverController.java:61)\n\tat
>  
> org.apache.hadoop.ha.ZKFailoverController$3.run(ZKFailoverController.java:604)\n\tat
>  
> org.apache.hadoop.ha.ZKFailoverController$3.run(ZKFailoverController.java:601)\n\tat
>  java.security.AccessController.doPrivileged(Native Method)\n\tat 
> javax.security.auth.Subject.doAs(Subject.java:422)\n\tat 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)\n\tat
>  
> org.apache.hadoop.ha.ZKFailoverController.gracefulFailoverToYou(ZKFailoverController.java:601)\n\tat
>  
> org.apache.hadoop.ha.ZKFCRpcServer.gracefulFailover(ZKFCRpcServer.java:94)\n\tat
>  
> org.apache.hadoop.ha.protocolPB.ZKFCProtocolServerSideTranslatorPB.gracefulFailover(ZKFCProtocolServerSideTranslatorPB.java:61)\n\tat
>  
> org.apache.hadoop.ha.proto.ZKFCProtocolProtos$ZKFCProtocolService$2.callBlockingMethod(ZKFCProtocolProtos.java:1548)\n\tat
>  
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)\n\tat
>  org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)\n\tat 
> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2137)\n\tat 
> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2133)\n\tat 
> java.security.AccessController.doPrivileged(Native Method)\n\tat 
> javax.security.auth.Subject.doAs(Subject.java:422)\n\tat 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)\n\tat
>  org.apache.hadoop.ipc.Server$Handler.run(Server.java:2131)')
> 2015-10-12 22:35:17,748 - Rolling Upgrade - failover command returned 255
> 2015-10-12 22:35:17,749 - call['ambari-sudo.sh su hdfs -l -s /bin/bash -c 'ls 
> /var/run/hadoop/hdfs/hadoop-hdfs-zkfc.pid > /dev/null 2>&1 && ps -p `cat 
> /var/run/hadoop/hdfs/hadoop-hdfs-zkfc.pid` > /dev/null 2>&1''] {}
> 2015-10-12 22:35:17,777 - call returned (0, '')
> 2015-10-12 22:35:17,778 - Execute['kill -15 `cat 
> /var/run/hadoop/hdfs/hadoop-hdfs-zkfc.pid`'] {'user': 'hdfs'}
> 2015-10-12 22:35:17,803 - File['/var/run/hadoop/hdfs/hadoop-hdfs-zkfc.pid'] 
> {'action': ['delete']}
> 2015-10-12 22:35:17,803 - Deleting 
> File['/var/run/hadoop/hdfs/hadoop-hdfs-zkfc.pid']
> 2015-10-12 22:35:17,803 - call['hdfs haadmin -getServiceState nn2 | grep 
> standby'] {'logoutput': True, 'user': 'hdfs'}
> 2015-10-12 22:35:20,922 - call returned (1, '')
> 2015-10-12 22:35:20,923 - Rolling Upgrade - check for standby returned 1
> 2015-10-12 22:35:20,923 - Waiting for this NameNode to become the standby one.
> 2015-10-12 22:35:20,923 - Execute['hdfs haadmin -getServiceState nn2 | grep 
> standby'] {'logoutput': True, 'tries': 50, 'user': 'hdfs', 'try_sleep': 6}
> 2015-10-12 22:35:23,135 - Retrying after 6 seconds. Reason: Execution of 
> 'hdfs haadmin -getServiceState nn2 | grep standby' returned 1. 
> 2015-10-12 22:35:31,388 - Retrying after 6 seconds. Reason: Execution of 
> 'hdfs haadmin -getServiceState nn2 | grep standby' returned 1. 
> 2015-10-12 22:35:39,709 - Retrying after 6 seconds. Reason: Execution of 
> 'hdfs haadmin -getServiceState nn2 | grep standby' returned 1. 
> 2015-10-12 22:35:47,992 - Retrying after 6 seconds. Reason: Execution of 
> 'hdfs haadmin -getServiceState nn2 | grep standby' returned 1. 
> 2015-10-12 22:35:56,289 - Retrying after 6 seconds. Reason: Execution of 
> 'hdfs haadmin -getServiceState nn2 | grep standby' returned 1. 
> 2015-10-12 22:36:04,627 - Retrying after 6 seconds. Reason: Execution of 
> 'hdfs haadmin -getServiceState nn2 | grep standby' returned 1. 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (AMBARI-13396) RU: Handle Namenode being down scenarios

Reply via email to