[jira] [Commented] (HADOOP-10251) Both NameNodes could be in STANDBY State if SNN network is unstable
[ https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14583083#comment-14583083 ] Vinayakumar B commented on HADOOP-10251: bq. Answer 2.when start only one ZKFC and NN ,the NN can be staying in ACTIVE for long time. Yes, fine. Its because, patch from this issue, is not merged properly to your code. In your source code, {{becomeActive()}} doesn't have line {{serviceState = HAServiceState.ACTIVE;}}. {code}private synchronized void becomeActive() throws ServiceFailedException { LOG.info(Trying to make + localTarget + active...); try { HAServiceProtocolHelper.transitionToActive(localTarget.getProxy( conf, FailoverController.getRpcTimeoutToNewActive(conf)), createReqInfo()); String msg = Successfully transitioned + localTarget + to active state; LOG.info(msg); recordActiveAttempt(new ActiveAttemptRecord(true, msg)); } catch (Throwable t) { String msg = Couldn't make + localTarget + active; LOG.fatal(msg, t); recordActiveAttempt(new ActiveAttemptRecord(false, msg + \n + StringUtils.stringifyException(t))); if (t instanceof ServiceFailedException) { throw (ServiceFailedException)t; } else { throw new ServiceFailedException(Couldn't transition to active, t); } /* TODO: we need to make sure that if we get fenced and then quickly restarted, none of these calls will retry across the restart boundary perhaps the solution is that, whenever the nn starts, it gets a unique ID, and when we start becoming active, we record it, and then any future calls use the same ID */ } }{code} So if the previous state of NameNode is not STANDBY, then it will stay for long time. But if its trasitioned from STANDBY, it will continously switch. Add {{serviceState = HAServiceState.ACTIVE;}} in {{becomeActive()}} after {{LOG.info(msg);}}, everything will be fine. Both NameNodes could be in STANDBY State if SNN network is unstable --- Key: HADOOP-10251 URL: https://issues.apache.org/jira/browse/HADOOP-10251 Project: Hadoop Common Issue Type: Bug Components: ha Affects Versions: 2.2.0 Reporter: Vinayakumar B Assignee: Vinayakumar B Priority: Critical Fix For: 2.5.0 Attachments: HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch Following corner scenario happened in one of our cluster. 1. NN1 was Active and NN2 was Standby 2. NN2 machine's network was slow 3. NN1 got shutdown. 4. NN2 ZKFC got the notification and trying to check for old active for fencing. (This took little more time, again due to slow network) 5. In between, NN1 got restarted by our automatic monitoring, and ZKFC made it Active. 6. Now NN2 ZKFC got Old Active as NN1 and it did graceful fencing of NN1 to STANBY. 7. Before writing ActiveBreadCrumb to ZK, NN2 ZKFC got session timeout and got shutdown before making NN2 Active. *Now cluster having both NameNodes as STANDBY.* NN1 ZKFC still thinks that its nameNode is in Active state. NN2 ZKFC waiting for election. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-10251) Both NameNodes could be in STANDBY State if SNN network is unstable
[ https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14583029#comment-14583029 ] lvchuanwen commented on HADOOP-10251: - /** * Licensed to the Apache Software Foundation (ASF) under one * or more contributor license agreements. See the NOTICE file * distributed with this work for additional information * regarding copyright ownership. The ASF licenses this file * to you under the Apache License, Version 2.0 (the * License); you may not use this file except in compliance * with the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an AS IS BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ package org.apache.hadoop.ha; import java.io.IOException; import java.net.InetSocketAddress; import java.security.PrivilegedAction; import java.security.PrivilegedExceptionAction; import java.util.Collections; import java.util.List; import java.util.concurrent.Executors; import java.util.concurrent.ScheduledExecutorService; import java.util.concurrent.TimeUnit; import org.apache.commons.logging.Log; import org.apache.commons.logging.LogFactory; import org.apache.hadoop.HadoopIllegalArgumentException; import org.apache.hadoop.classification.InterfaceAudience; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.ha.ActiveStandbyElector.ActiveNotFoundException; import org.apache.hadoop.ha.ActiveStandbyElector.ActiveStandbyElectorCallback; import org.apache.hadoop.ha.HAServiceProtocol.HAServiceState; import org.apache.hadoop.ha.HAServiceProtocol.StateChangeRequestInfo; import org.apache.hadoop.ha.HAServiceProtocol.RequestSource; import org.apache.hadoop.util.ZKUtil; import org.apache.hadoop.util.ZKUtil.ZKAuthInfo; import org.apache.hadoop.ha.HealthMonitor.State; import org.apache.hadoop.ipc.Server; import org.apache.hadoop.security.AccessControlException; import org.apache.hadoop.security.SecurityUtil; import org.apache.hadoop.security.UserGroupInformation; import org.apache.hadoop.security.authorize.PolicyProvider; import org.apache.hadoop.util.StringUtils; import org.apache.zookeeper.KeeperException; import org.apache.zookeeper.ZooDefs.Ids; import org.apache.hadoop.util.ToolRunner; import org.apache.zookeeper.data.ACL; import com.google.common.annotations.VisibleForTesting; import com.google.common.base.Preconditions; import com.google.common.base.Throwables; import com.google.common.util.concurrent.ThreadFactoryBuilder; @InterfaceAudience.LimitedPrivate(HDFS) public abstract class ZKFailoverController { static final Log LOG = LogFactory.getLog(ZKFailoverController.class); public static final String ZK_QUORUM_KEY = ha.zookeeper.quorum; private static final String ZK_SESSION_TIMEOUT_KEY = ha.zookeeper.session-timeout.ms; private static final int ZK_SESSION_TIMEOUT_DEFAULT = 5*1000; private static final String ZK_PARENT_ZNODE_KEY = ha.zookeeper.parent-znode; public static final String ZK_ACL_KEY = ha.zookeeper.acl; private static final String ZK_ACL_DEFAULT = world:anyone:rwcda; public static final String ZK_AUTH_KEY = ha.zookeeper.auth; static final String ZK_PARENT_ZNODE_DEFAULT = /hadoop-ha; /** * All of the conf keys used by the ZKFC. This is used in order to allow * them to be overridden on a per-nameservice or per-namenode basis. */ protected static final String[] ZKFC_CONF_KEYS = new String[] { ZK_QUORUM_KEY, ZK_SESSION_TIMEOUT_KEY, ZK_PARENT_ZNODE_KEY, ZK_ACL_KEY, ZK_AUTH_KEY }; protected static final String USAGE = Usage: java zkfc [ -formatZK [-force] [-nonInteractive] ]; /** Unable to format the parent znode in ZK */ static final int ERR_CODE_FORMAT_DENIED = 2; /** The parent znode doesn't exist in ZK */ static final int ERR_CODE_NO_PARENT_ZNODE = 3; /** Fencing is not properly configured */ static final int ERR_CODE_NO_FENCER = 4; /** Automatic failover is not enabled */ static final int ERR_CODE_AUTO_FAILOVER_NOT_ENABLED = 5; /** Cannot connect to ZooKeeper */ static final int ERR_CODE_NO_ZK = 6; protected Configuration conf; private String zkQuorum; protected final HAServiceTarget localTarget; private HealthMonitor healthMonitor; private ActiveStandbyElector elector; protected ZKFCRpcServer rpcServer; private State lastHealthState = State.INITIALIZING; private volatile HAServiceState serviceState = HAServiceState.INITIALIZING; /** Set if a fatal error occurs */ private String fatalError = null; /** * A future nanotime before which the ZKFC will not join the election. * This is used during graceful failover. */
[jira] [Commented] (HADOOP-10251) Both NameNodes could be in STANDBY State if SNN network is unstable
[ https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14583024#comment-14583024 ] lvchuanwen commented on HADOOP-10251: - Answer 2.when start only one ZKFC and NN ,the NN can be staying in ACTIVE for long time. hdfs-nn1-zkfc-host195.log 2015-06-12 07:18:11,471 INFO org.apache.hadoop.hdfs.tools.DFSZKFailoverController: Failover controller configured for NameNode NameNode at zdh195/10.43.156.195:9000 2015-06-12 07:18:11,608 INFO org.apache.zookeeper.ZooKeeper: Client environment:zookeeper.version=3.4.5-cdh5.3.2--1, built on 05/15/2015 03:44 GMT 2015-06-12 07:18:11,608 INFO org.apache.zookeeper.ZooKeeper: Client environment:host.name=zdh195 2015-06-12 07:18:11,608 INFO org.apache.zookeeper.ZooKeeper: Client environment:java.version=1.7.0_55 2015-06-12 07:18:11,608 INFO org.apache.zookeeper.ZooKeeper: Client environment:java.vendor=Oracle Corporation 2015-06-12 07:18:11,608 INFO org.apache.zookeeper.ZooKeeper: Client environment:java.home=/usr/java/jdk/jre 2015-06-12 07:18:11,608 INFO org.apache.zookeeper.ZooKeeper: Client
[jira] [Commented] (HADOOP-10251) Both NameNodes could be in STANDBY State if SNN network is unstable
[ https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14583120#comment-14583120 ] lvchuanwen commented on HADOOP-10251: - thank u very very much . i am so careless. Both NameNodes could be in STANDBY State if SNN network is unstable --- Key: HADOOP-10251 URL: https://issues.apache.org/jira/browse/HADOOP-10251 Project: Hadoop Common Issue Type: Bug Components: ha Affects Versions: 2.2.0 Reporter: Vinayakumar B Assignee: Vinayakumar B Priority: Critical Fix For: 2.5.0 Attachments: HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch Following corner scenario happened in one of our cluster. 1. NN1 was Active and NN2 was Standby 2. NN2 machine's network was slow 3. NN1 got shutdown. 4. NN2 ZKFC got the notification and trying to check for old active for fencing. (This took little more time, again due to slow network) 5. In between, NN1 got restarted by our automatic monitoring, and ZKFC made it Active. 6. Now NN2 ZKFC got Old Active as NN1 and it did graceful fencing of NN1 to STANBY. 7. Before writing ActiveBreadCrumb to ZK, NN2 ZKFC got session timeout and got shutdown before making NN2 Active. *Now cluster having both NameNodes as STANDBY.* NN1 ZKFC still thinks that its nameNode is in Active state. NN2 ZKFC waiting for election. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-10251) Both NameNodes could be in STANDBY State if SNN network is unstable
[ https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14581700#comment-14581700 ] Vinayakumar B commented on HADOOP-10251: There is no difference in the code for the normal auto failover and failover using haadmin. Normal auto failover is working for you? Both NameNodes could be in STANDBY State if SNN network is unstable --- Key: HADOOP-10251 URL: https://issues.apache.org/jira/browse/HADOOP-10251 Project: Hadoop Common Issue Type: Bug Components: ha Affects Versions: 2.2.0 Reporter: Vinayakumar B Assignee: Vinayakumar B Priority: Critical Fix For: 2.5.0 Attachments: HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch Following corner scenario happened in one of our cluster. 1. NN1 was Active and NN2 was Standby 2. NN2 machine's network was slow 3. NN1 got shutdown. 4. NN2 ZKFC got the notification and trying to check for old active for fencing. (This took little more time, again due to slow network) 5. In between, NN1 got restarted by our automatic monitoring, and ZKFC made it Active. 6. Now NN2 ZKFC got Old Active as NN1 and it did graceful fencing of NN1 to STANBY. 7. Before writing ActiveBreadCrumb to ZK, NN2 ZKFC got session timeout and got shutdown before making NN2 Active. *Now cluster having both NameNodes as STANDBY.* NN1 ZKFC still thinks that its nameNode is in Active state. NN2 ZKFC waiting for election. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-10251) Both NameNodes could be in STANDBY State if SNN network is unstable
[ https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14581725#comment-14581725 ] lvchuanwen commented on HADOOP-10251: - auto failover is normal . when i kill the process of active namenode nn1 ,nn2 can transition to active . I find if the key ha.health-monitor.check-interval.ms setting is two different values. hdfs haadmin is no problem. nn1 setting property nameha.health-monitor.check-interval.ms/name value2000/value /property nn2 setting property nameha.health-monitor.check-interval.ms/name value1000/value /property Both NameNodes could be in STANDBY State if SNN network is unstable --- Key: HADOOP-10251 URL: https://issues.apache.org/jira/browse/HADOOP-10251 Project: Hadoop Common Issue Type: Bug Components: ha Affects Versions: 2.2.0 Reporter: Vinayakumar B Assignee: Vinayakumar B Priority: Critical Fix For: 2.5.0 Attachments: HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch Following corner scenario happened in one of our cluster. 1. NN1 was Active and NN2 was Standby 2. NN2 machine's network was slow 3. NN1 got shutdown. 4. NN2 ZKFC got the notification and trying to check for old active for fencing. (This took little more time, again due to slow network) 5. In between, NN1 got restarted by our automatic monitoring, and ZKFC made it Active. 6. Now NN2 ZKFC got Old Active as NN1 and it did graceful fencing of NN1 to STANBY. 7. Before writing ActiveBreadCrumb to ZK, NN2 ZKFC got session timeout and got shutdown before making NN2 Active. *Now cluster having both NameNodes as STANDBY.* NN1 ZKFC still thinks that its nameNode is in Active state. NN2 ZKFC waiting for election. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-10251) Both NameNodes could be in STANDBY State if SNN network is unstable
[ https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14581735#comment-14581735 ] Vinayakumar B commented on HADOOP-10251: bq. when i kill the process of active namenode nn1 ,nn2 can transition to active . When you kill active nn2, whether nn1 is transitioning to active? Both NameNodes could be in STANDBY State if SNN network is unstable --- Key: HADOOP-10251 URL: https://issues.apache.org/jira/browse/HADOOP-10251 Project: Hadoop Common Issue Type: Bug Components: ha Affects Versions: 2.2.0 Reporter: Vinayakumar B Assignee: Vinayakumar B Priority: Critical Fix For: 2.5.0 Attachments: HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch Following corner scenario happened in one of our cluster. 1. NN1 was Active and NN2 was Standby 2. NN2 machine's network was slow 3. NN1 got shutdown. 4. NN2 ZKFC got the notification and trying to check for old active for fencing. (This took little more time, again due to slow network) 5. In between, NN1 got restarted by our automatic monitoring, and ZKFC made it Active. 6. Now NN2 ZKFC got Old Active as NN1 and it did graceful fencing of NN1 to STANBY. 7. Before writing ActiveBreadCrumb to ZK, NN2 ZKFC got session timeout and got shutdown before making NN2 Active. *Now cluster having both NameNodes as STANDBY.* NN1 ZKFC still thinks that its nameNode is in Active state. NN2 ZKFC waiting for election. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-10251) Both NameNodes could be in STANDBY State if SNN network is unstable
[ https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14581766#comment-14581766 ] lvchuanwen commented on HADOOP-10251: - yes .hadoop-deamon.sh start namenode nn1 then kill nn2 ,nn1 can transition to active Both NameNodes could be in STANDBY State if SNN network is unstable --- Key: HADOOP-10251 URL: https://issues.apache.org/jira/browse/HADOOP-10251 Project: Hadoop Common Issue Type: Bug Components: ha Affects Versions: 2.2.0 Reporter: Vinayakumar B Assignee: Vinayakumar B Priority: Critical Fix For: 2.5.0 Attachments: HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch Following corner scenario happened in one of our cluster. 1. NN1 was Active and NN2 was Standby 2. NN2 machine's network was slow 3. NN1 got shutdown. 4. NN2 ZKFC got the notification and trying to check for old active for fencing. (This took little more time, again due to slow network) 5. In between, NN1 got restarted by our automatic monitoring, and ZKFC made it Active. 6. Now NN2 ZKFC got Old Active as NN1 and it did graceful fencing of NN1 to STANBY. 7. Before writing ActiveBreadCrumb to ZK, NN2 ZKFC got session timeout and got shutdown before making NN2 Active. *Now cluster having both NameNodes as STANDBY.* NN1 ZKFC still thinks that its nameNode is in Active state. NN2 ZKFC waiting for election. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-10251) Both NameNodes could be in STANDBY State if SNN network is unstable
[ https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14581775#comment-14581775 ] lvchuanwen commented on HADOOP-10251: - yes .hadoop-deamon.sh start namenode nn1 then kill nn2 ,nn1 can transition to active Both NameNodes could be in STANDBY State if SNN network is unstable --- Key: HADOOP-10251 URL: https://issues.apache.org/jira/browse/HADOOP-10251 Project: Hadoop Common Issue Type: Bug Components: ha Affects Versions: 2.2.0 Reporter: Vinayakumar B Assignee: Vinayakumar B Priority: Critical Fix For: 2.5.0 Attachments: HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch Following corner scenario happened in one of our cluster. 1. NN1 was Active and NN2 was Standby 2. NN2 machine's network was slow 3. NN1 got shutdown. 4. NN2 ZKFC got the notification and trying to check for old active for fencing. (This took little more time, again due to slow network) 5. In between, NN1 got restarted by our automatic monitoring, and ZKFC made it Active. 6. Now NN2 ZKFC got Old Active as NN1 and it did graceful fencing of NN1 to STANBY. 7. Before writing ActiveBreadCrumb to ZK, NN2 ZKFC got session timeout and got shutdown before making NN2 Active. *Now cluster having both NameNodes as STANDBY.* NN1 ZKFC still thinks that its nameNode is in Active state. NN2 ZKFC waiting for election. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-10251) Both NameNodes could be in STANDBY State if SNN network is unstable
[ https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14581777#comment-14581777 ] lvchuanwen commented on HADOOP-10251: - yes .hadoop-deamon.sh start namenode nn1 then kill nn2 ,nn1 can transition to active Both NameNodes could be in STANDBY State if SNN network is unstable --- Key: HADOOP-10251 URL: https://issues.apache.org/jira/browse/HADOOP-10251 Project: Hadoop Common Issue Type: Bug Components: ha Affects Versions: 2.2.0 Reporter: Vinayakumar B Assignee: Vinayakumar B Priority: Critical Fix For: 2.5.0 Attachments: HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch Following corner scenario happened in one of our cluster. 1. NN1 was Active and NN2 was Standby 2. NN2 machine's network was slow 3. NN1 got shutdown. 4. NN2 ZKFC got the notification and trying to check for old active for fencing. (This took little more time, again due to slow network) 5. In between, NN1 got restarted by our automatic monitoring, and ZKFC made it Active. 6. Now NN2 ZKFC got Old Active as NN1 and it did graceful fencing of NN1 to STANBY. 7. Before writing ActiveBreadCrumb to ZK, NN2 ZKFC got session timeout and got shutdown before making NN2 Active. *Now cluster having both NameNodes as STANDBY.* NN1 ZKFC still thinks that its nameNode is in Active state. NN2 ZKFC waiting for election. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-10251) Both NameNodes could be in STANDBY State if SNN network is unstable
[ https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14581765#comment-14581765 ] lvchuanwen commented on HADOOP-10251: - yes .hadoop-deamon.sh start namenode nn1 then kill nn2 ,nn1 can transition to active Both NameNodes could be in STANDBY State if SNN network is unstable --- Key: HADOOP-10251 URL: https://issues.apache.org/jira/browse/HADOOP-10251 Project: Hadoop Common Issue Type: Bug Components: ha Affects Versions: 2.2.0 Reporter: Vinayakumar B Assignee: Vinayakumar B Priority: Critical Fix For: 2.5.0 Attachments: HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch Following corner scenario happened in one of our cluster. 1. NN1 was Active and NN2 was Standby 2. NN2 machine's network was slow 3. NN1 got shutdown. 4. NN2 ZKFC got the notification and trying to check for old active for fencing. (This took little more time, again due to slow network) 5. In between, NN1 got restarted by our automatic monitoring, and ZKFC made it Active. 6. Now NN2 ZKFC got Old Active as NN1 and it did graceful fencing of NN1 to STANBY. 7. Before writing ActiveBreadCrumb to ZK, NN2 ZKFC got session timeout and got shutdown before making NN2 Active. *Now cluster having both NameNodes as STANDBY.* NN1 ZKFC still thinks that its nameNode is in Active state. NN2 ZKFC waiting for election. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-10251) Both NameNodes could be in STANDBY State if SNN network is unstable
[ https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14581763#comment-14581763 ] lvchuanwen commented on HADOOP-10251: - yes .hadoop-deamon.sh start namenode nn1 then kill nn2 ,nn1 can transition to active Both NameNodes could be in STANDBY State if SNN network is unstable --- Key: HADOOP-10251 URL: https://issues.apache.org/jira/browse/HADOOP-10251 Project: Hadoop Common Issue Type: Bug Components: ha Affects Versions: 2.2.0 Reporter: Vinayakumar B Assignee: Vinayakumar B Priority: Critical Fix For: 2.5.0 Attachments: HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch Following corner scenario happened in one of our cluster. 1. NN1 was Active and NN2 was Standby 2. NN2 machine's network was slow 3. NN1 got shutdown. 4. NN2 ZKFC got the notification and trying to check for old active for fencing. (This took little more time, again due to slow network) 5. In between, NN1 got restarted by our automatic monitoring, and ZKFC made it Active. 6. Now NN2 ZKFC got Old Active as NN1 and it did graceful fencing of NN1 to STANBY. 7. Before writing ActiveBreadCrumb to ZK, NN2 ZKFC got session timeout and got shutdown before making NN2 Active. *Now cluster having both NameNodes as STANDBY.* NN1 ZKFC still thinks that its nameNode is in Active state. NN2 ZKFC waiting for election. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-10251) Both NameNodes could be in STANDBY State if SNN network is unstable
[ https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14581768#comment-14581768 ] lvchuanwen commented on HADOOP-10251: - yes .hadoop-deamon.sh start namenode nn1 then kill nn2 ,nn1 can transition to active Both NameNodes could be in STANDBY State if SNN network is unstable --- Key: HADOOP-10251 URL: https://issues.apache.org/jira/browse/HADOOP-10251 Project: Hadoop Common Issue Type: Bug Components: ha Affects Versions: 2.2.0 Reporter: Vinayakumar B Assignee: Vinayakumar B Priority: Critical Fix For: 2.5.0 Attachments: HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch Following corner scenario happened in one of our cluster. 1. NN1 was Active and NN2 was Standby 2. NN2 machine's network was slow 3. NN1 got shutdown. 4. NN2 ZKFC got the notification and trying to check for old active for fencing. (This took little more time, again due to slow network) 5. In between, NN1 got restarted by our automatic monitoring, and ZKFC made it Active. 6. Now NN2 ZKFC got Old Active as NN1 and it did graceful fencing of NN1 to STANBY. 7. Before writing ActiveBreadCrumb to ZK, NN2 ZKFC got session timeout and got shutdown before making NN2 Active. *Now cluster having both NameNodes as STANDBY.* NN1 ZKFC still thinks that its nameNode is in Active state. NN2 ZKFC waiting for election. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-10251) Both NameNodes could be in STANDBY State if SNN network is unstable
[ https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14581773#comment-14581773 ] lvchuanwen commented on HADOOP-10251: - yes .hadoop-deamon.sh start namenode nn1 then kill nn2 ,nn1 can transition to active Both NameNodes could be in STANDBY State if SNN network is unstable --- Key: HADOOP-10251 URL: https://issues.apache.org/jira/browse/HADOOP-10251 Project: Hadoop Common Issue Type: Bug Components: ha Affects Versions: 2.2.0 Reporter: Vinayakumar B Assignee: Vinayakumar B Priority: Critical Fix For: 2.5.0 Attachments: HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch Following corner scenario happened in one of our cluster. 1. NN1 was Active and NN2 was Standby 2. NN2 machine's network was slow 3. NN1 got shutdown. 4. NN2 ZKFC got the notification and trying to check for old active for fencing. (This took little more time, again due to slow network) 5. In between, NN1 got restarted by our automatic monitoring, and ZKFC made it Active. 6. Now NN2 ZKFC got Old Active as NN1 and it did graceful fencing of NN1 to STANBY. 7. Before writing ActiveBreadCrumb to ZK, NN2 ZKFC got session timeout and got shutdown before making NN2 Active. *Now cluster having both NameNodes as STANDBY.* NN1 ZKFC still thinks that its nameNode is in Active state. NN2 ZKFC waiting for election. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-10251) Both NameNodes could be in STANDBY State if SNN network is unstable
[ https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14581772#comment-14581772 ] lvchuanwen commented on HADOOP-10251: - yes .hadoop-deamon.sh start namenode nn1 then kill nn2 ,nn1 can transition to active Both NameNodes could be in STANDBY State if SNN network is unstable --- Key: HADOOP-10251 URL: https://issues.apache.org/jira/browse/HADOOP-10251 Project: Hadoop Common Issue Type: Bug Components: ha Affects Versions: 2.2.0 Reporter: Vinayakumar B Assignee: Vinayakumar B Priority: Critical Fix For: 2.5.0 Attachments: HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch Following corner scenario happened in one of our cluster. 1. NN1 was Active and NN2 was Standby 2. NN2 machine's network was slow 3. NN1 got shutdown. 4. NN2 ZKFC got the notification and trying to check for old active for fencing. (This took little more time, again due to slow network) 5. In between, NN1 got restarted by our automatic monitoring, and ZKFC made it Active. 6. Now NN2 ZKFC got Old Active as NN1 and it did graceful fencing of NN1 to STANBY. 7. Before writing ActiveBreadCrumb to ZK, NN2 ZKFC got session timeout and got shutdown before making NN2 Active. *Now cluster having both NameNodes as STANDBY.* NN1 ZKFC still thinks that its nameNode is in Active state. NN2 ZKFC waiting for election. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-10251) Both NameNodes could be in STANDBY State if SNN network is unstable
[ https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14581795#comment-14581795 ] Vinayakumar B commented on HADOOP-10251: I am not getting whats the problem in your cluster. That too only with haadmin failover where as auto failover works fine. Can you share the autofailover logs for zkfc. Both NameNodes could be in STANDBY State if SNN network is unstable --- Key: HADOOP-10251 URL: https://issues.apache.org/jira/browse/HADOOP-10251 Project: Hadoop Common Issue Type: Bug Components: ha Affects Versions: 2.2.0 Reporter: Vinayakumar B Assignee: Vinayakumar B Priority: Critical Fix For: 2.5.0 Attachments: HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch Following corner scenario happened in one of our cluster. 1. NN1 was Active and NN2 was Standby 2. NN2 machine's network was slow 3. NN1 got shutdown. 4. NN2 ZKFC got the notification and trying to check for old active for fencing. (This took little more time, again due to slow network) 5. In between, NN1 got restarted by our automatic monitoring, and ZKFC made it Active. 6. Now NN2 ZKFC got Old Active as NN1 and it did graceful fencing of NN1 to STANBY. 7. Before writing ActiveBreadCrumb to ZK, NN2 ZKFC got session timeout and got shutdown before making NN2 Active. *Now cluster having both NameNodes as STANDBY.* NN1 ZKFC still thinks that its nameNode is in Active state. NN2 ZKFC waiting for election. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-10251) Both NameNodes could be in STANDBY State if SNN network is unstable
[ https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14581767#comment-14581767 ] lvchuanwen commented on HADOOP-10251: - yes .hadoop-deamon.sh start namenode nn1 then kill nn2 ,nn1 can transition to active Both NameNodes could be in STANDBY State if SNN network is unstable --- Key: HADOOP-10251 URL: https://issues.apache.org/jira/browse/HADOOP-10251 Project: Hadoop Common Issue Type: Bug Components: ha Affects Versions: 2.2.0 Reporter: Vinayakumar B Assignee: Vinayakumar B Priority: Critical Fix For: 2.5.0 Attachments: HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch Following corner scenario happened in one of our cluster. 1. NN1 was Active and NN2 was Standby 2. NN2 machine's network was slow 3. NN1 got shutdown. 4. NN2 ZKFC got the notification and trying to check for old active for fencing. (This took little more time, again due to slow network) 5. In between, NN1 got restarted by our automatic monitoring, and ZKFC made it Active. 6. Now NN2 ZKFC got Old Active as NN1 and it did graceful fencing of NN1 to STANBY. 7. Before writing ActiveBreadCrumb to ZK, NN2 ZKFC got session timeout and got shutdown before making NN2 Active. *Now cluster having both NameNodes as STANDBY.* NN1 ZKFC still thinks that its nameNode is in Active state. NN2 ZKFC waiting for election. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-10251) Both NameNodes could be in STANDBY State if SNN network is unstable
[ https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14581769#comment-14581769 ] lvchuanwen commented on HADOOP-10251: - yes .hadoop-deamon.sh start namenode nn1 then kill nn2 ,nn1 can transition to active Both NameNodes could be in STANDBY State if SNN network is unstable --- Key: HADOOP-10251 URL: https://issues.apache.org/jira/browse/HADOOP-10251 Project: Hadoop Common Issue Type: Bug Components: ha Affects Versions: 2.2.0 Reporter: Vinayakumar B Assignee: Vinayakumar B Priority: Critical Fix For: 2.5.0 Attachments: HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch Following corner scenario happened in one of our cluster. 1. NN1 was Active and NN2 was Standby 2. NN2 machine's network was slow 3. NN1 got shutdown. 4. NN2 ZKFC got the notification and trying to check for old active for fencing. (This took little more time, again due to slow network) 5. In between, NN1 got restarted by our automatic monitoring, and ZKFC made it Active. 6. Now NN2 ZKFC got Old Active as NN1 and it did graceful fencing of NN1 to STANBY. 7. Before writing ActiveBreadCrumb to ZK, NN2 ZKFC got session timeout and got shutdown before making NN2 Active. *Now cluster having both NameNodes as STANDBY.* NN1 ZKFC still thinks that its nameNode is in Active state. NN2 ZKFC waiting for election. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-10251) Both NameNodes could be in STANDBY State if SNN network is unstable
[ https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14581770#comment-14581770 ] lvchuanwen commented on HADOOP-10251: - yes .hadoop-deamon.sh start namenode nn1 then kill nn2 ,nn1 can transition to active Both NameNodes could be in STANDBY State if SNN network is unstable --- Key: HADOOP-10251 URL: https://issues.apache.org/jira/browse/HADOOP-10251 Project: Hadoop Common Issue Type: Bug Components: ha Affects Versions: 2.2.0 Reporter: Vinayakumar B Assignee: Vinayakumar B Priority: Critical Fix For: 2.5.0 Attachments: HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch Following corner scenario happened in one of our cluster. 1. NN1 was Active and NN2 was Standby 2. NN2 machine's network was slow 3. NN1 got shutdown. 4. NN2 ZKFC got the notification and trying to check for old active for fencing. (This took little more time, again due to slow network) 5. In between, NN1 got restarted by our automatic monitoring, and ZKFC made it Active. 6. Now NN2 ZKFC got Old Active as NN1 and it did graceful fencing of NN1 to STANBY. 7. Before writing ActiveBreadCrumb to ZK, NN2 ZKFC got session timeout and got shutdown before making NN2 Active. *Now cluster having both NameNodes as STANDBY.* NN1 ZKFC still thinks that its nameNode is in Active state. NN2 ZKFC waiting for election. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-10251) Both NameNodes could be in STANDBY State if SNN network is unstable
[ https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14581778#comment-14581778 ] lvchuanwen commented on HADOOP-10251: - yes .hadoop-deamon.sh start namenode nn1 then kill nn2 ,nn1 can transition to active Both NameNodes could be in STANDBY State if SNN network is unstable --- Key: HADOOP-10251 URL: https://issues.apache.org/jira/browse/HADOOP-10251 Project: Hadoop Common Issue Type: Bug Components: ha Affects Versions: 2.2.0 Reporter: Vinayakumar B Assignee: Vinayakumar B Priority: Critical Fix For: 2.5.0 Attachments: HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch Following corner scenario happened in one of our cluster. 1. NN1 was Active and NN2 was Standby 2. NN2 machine's network was slow 3. NN1 got shutdown. 4. NN2 ZKFC got the notification and trying to check for old active for fencing. (This took little more time, again due to slow network) 5. In between, NN1 got restarted by our automatic monitoring, and ZKFC made it Active. 6. Now NN2 ZKFC got Old Active as NN1 and it did graceful fencing of NN1 to STANBY. 7. Before writing ActiveBreadCrumb to ZK, NN2 ZKFC got session timeout and got shutdown before making NN2 Active. *Now cluster having both NameNodes as STANDBY.* NN1 ZKFC still thinks that its nameNode is in Active state. NN2 ZKFC waiting for election. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-10251) Both NameNodes could be in STANDBY State if SNN network is unstable
[ https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14581774#comment-14581774 ] lvchuanwen commented on HADOOP-10251: - yes .hadoop-deamon.sh start namenode nn1 then kill nn2 ,nn1 can transition to active Both NameNodes could be in STANDBY State if SNN network is unstable --- Key: HADOOP-10251 URL: https://issues.apache.org/jira/browse/HADOOP-10251 Project: Hadoop Common Issue Type: Bug Components: ha Affects Versions: 2.2.0 Reporter: Vinayakumar B Assignee: Vinayakumar B Priority: Critical Fix For: 2.5.0 Attachments: HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch Following corner scenario happened in one of our cluster. 1. NN1 was Active and NN2 was Standby 2. NN2 machine's network was slow 3. NN1 got shutdown. 4. NN2 ZKFC got the notification and trying to check for old active for fencing. (This took little more time, again due to slow network) 5. In between, NN1 got restarted by our automatic monitoring, and ZKFC made it Active. 6. Now NN2 ZKFC got Old Active as NN1 and it did graceful fencing of NN1 to STANBY. 7. Before writing ActiveBreadCrumb to ZK, NN2 ZKFC got session timeout and got shutdown before making NN2 Active. *Now cluster having both NameNodes as STANDBY.* NN1 ZKFC still thinks that its nameNode is in Active state. NN2 ZKFC waiting for election. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-10251) Both NameNodes could be in STANDBY State if SNN network is unstable
[ https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14581771#comment-14581771 ] lvchuanwen commented on HADOOP-10251: - yes .hadoop-deamon.sh start namenode nn1 then kill nn2 ,nn1 can transition to active Both NameNodes could be in STANDBY State if SNN network is unstable --- Key: HADOOP-10251 URL: https://issues.apache.org/jira/browse/HADOOP-10251 Project: Hadoop Common Issue Type: Bug Components: ha Affects Versions: 2.2.0 Reporter: Vinayakumar B Assignee: Vinayakumar B Priority: Critical Fix For: 2.5.0 Attachments: HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch Following corner scenario happened in one of our cluster. 1. NN1 was Active and NN2 was Standby 2. NN2 machine's network was slow 3. NN1 got shutdown. 4. NN2 ZKFC got the notification and trying to check for old active for fencing. (This took little more time, again due to slow network) 5. In between, NN1 got restarted by our automatic monitoring, and ZKFC made it Active. 6. Now NN2 ZKFC got Old Active as NN1 and it did graceful fencing of NN1 to STANBY. 7. Before writing ActiveBreadCrumb to ZK, NN2 ZKFC got session timeout and got shutdown before making NN2 Active. *Now cluster having both NameNodes as STANDBY.* NN1 ZKFC still thinks that its nameNode is in Active state. NN2 ZKFC waiting for election. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-10251) Both NameNodes could be in STANDBY State if SNN network is unstable
[ https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14581776#comment-14581776 ] lvchuanwen commented on HADOOP-10251: - yes .hadoop-deamon.sh start namenode nn1 then kill nn2 ,nn1 can transition to active Both NameNodes could be in STANDBY State if SNN network is unstable --- Key: HADOOP-10251 URL: https://issues.apache.org/jira/browse/HADOOP-10251 Project: Hadoop Common Issue Type: Bug Components: ha Affects Versions: 2.2.0 Reporter: Vinayakumar B Assignee: Vinayakumar B Priority: Critical Fix For: 2.5.0 Attachments: HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch Following corner scenario happened in one of our cluster. 1. NN1 was Active and NN2 was Standby 2. NN2 machine's network was slow 3. NN1 got shutdown. 4. NN2 ZKFC got the notification and trying to check for old active for fencing. (This took little more time, again due to slow network) 5. In between, NN1 got restarted by our automatic monitoring, and ZKFC made it Active. 6. Now NN2 ZKFC got Old Active as NN1 and it did graceful fencing of NN1 to STANBY. 7. Before writing ActiveBreadCrumb to ZK, NN2 ZKFC got session timeout and got shutdown before making NN2 Active. *Now cluster having both NameNodes as STANDBY.* NN1 ZKFC still thinks that its nameNode is in Active state. NN2 ZKFC waiting for election. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-10251) Both NameNodes could be in STANDBY State if SNN network is unstable
[ https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14582914#comment-14582914 ] Vinayakumar B commented on HADOOP-10251: bq. 2015-06-12 02:26:02,608 ERROR org.apache.hadoop.ha.ZKFailoverController: Local service NameNode at zdh196/10.43.156.196:9000 has changed the serviceState to active. Expected was standby. Quitting election marking fencing necessary. bq. 2015-06-12 02:27:56,878 ERROR org.apache.hadoop.ha.ZKFailoverController: Local service NameNode at zdh195/10.43.156.195:9000 has changed the serviceState to active. Expected was standby. Quitting election marking fencing necessary. The use case mentioned is normal auto failover. Not the manual failover using haadmin commands. And I am seeing both NN1 and NN2 are not staying in Active mode if the transition happens from standby-active. This is strange. Can you check this ? 1. Stop both ZKFCs. and NNs. 2. Start only one ZKFC and NN. It should successfully convert to active and check whether it is staying in ACTIVE for long time. 3. Attach the logs for ZKFC after restart, (logs from the restarted point, all lines) 4. Also attach the ZKFailoverController.java source code you are using. Both NameNodes could be in STANDBY State if SNN network is unstable --- Key: HADOOP-10251 URL: https://issues.apache.org/jira/browse/HADOOP-10251 Project: Hadoop Common Issue Type: Bug Components: ha Affects Versions: 2.2.0 Reporter: Vinayakumar B Assignee: Vinayakumar B Priority: Critical Fix For: 2.5.0 Attachments: HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch Following corner scenario happened in one of our cluster. 1. NN1 was Active and NN2 was Standby 2. NN2 machine's network was slow 3. NN1 got shutdown. 4. NN2 ZKFC got the notification and trying to check for old active for fencing. (This took little more time, again due to slow network) 5. In between, NN1 got restarted by our automatic monitoring, and ZKFC made it Active. 6. Now NN2 ZKFC got Old Active as NN1 and it did graceful fencing of NN1 to STANBY. 7. Before writing ActiveBreadCrumb to ZK, NN2 ZKFC got session timeout and got shutdown before making NN2 Active. *Now cluster having both NameNodes as STANDBY.* NN1 ZKFC still thinks that its nameNode is in Active state. NN2 ZKFC waiting for election. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-10251) Both NameNodes could be in STANDBY State if SNN network is unstable
[ https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14581590#comment-14581590 ] Vinayakumar B commented on HADOOP-10251: Which version of Hadoop You are using? Because I can see below logs (excluded DEBUG), {noformat}2015-06-10 02:57:56,073 INFO org.apache.hadoop.ha.ZKFailoverController: Successfully transitioned NameNode at zdh195/10.43.156.195:9000 to active state 2015-06-10 02:57:56,092 INFO org.apache.hadoop.ha.ZKFailoverController: Successfully became active. Successfully transitioned NameNode at zdh195/10.43.156.195:9000 to active state 2015-06-10 02:57:57,082 ERROR org.apache.hadoop.ha.ZKFailoverController: Local service NameNode at zdh195/10.43.156.195:9000 has changed the serviceState to active. Expected was standby. Quitting election marking fencing necessary.{noformat} Immediately after {{becomeActive()}}, ERROR log is showing state expected is {{standby}}. {{serviceState}} is changed to {{active}} in {{becomeActive()}} immediately after above log. IMO, this is possible only if {{volatile}} is not present while declaring {{serviceState}} {code}private volatile HAServiceState serviceState = HAServiceState.INITIALIZING;{code} do you have this in your code? Both NameNodes could be in STANDBY State if SNN network is unstable --- Key: HADOOP-10251 URL: https://issues.apache.org/jira/browse/HADOOP-10251 Project: Hadoop Common Issue Type: Bug Components: ha Affects Versions: 2.2.0 Reporter: Vinayakumar B Assignee: Vinayakumar B Priority: Critical Fix For: 2.5.0 Attachments: HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch Following corner scenario happened in one of our cluster. 1. NN1 was Active and NN2 was Standby 2. NN2 machine's network was slow 3. NN1 got shutdown. 4. NN2 ZKFC got the notification and trying to check for old active for fencing. (This took little more time, again due to slow network) 5. In between, NN1 got restarted by our automatic monitoring, and ZKFC made it Active. 6. Now NN2 ZKFC got Old Active as NN1 and it did graceful fencing of NN1 to STANBY. 7. Before writing ActiveBreadCrumb to ZK, NN2 ZKFC got session timeout and got shutdown before making NN2 Active. *Now cluster having both NameNodes as STANDBY.* NN1 ZKFC still thinks that its nameNode is in Active state. NN2 ZKFC waiting for election. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-10251) Both NameNodes could be in STANDBY State if SNN network is unstable
[ https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14581560#comment-14581560 ] lvchuanwen commented on HADOOP-10251: - You can try the command hdfs haadmin -failover nn1 nn2 , and then see if the active node nn1 is normal. nn1 will always change state .active - standby - active - standby ... sorry for my poor english ,hope you can understand.thanks. Both NameNodes could be in STANDBY State if SNN network is unstable --- Key: HADOOP-10251 URL: https://issues.apache.org/jira/browse/HADOOP-10251 Project: Hadoop Common Issue Type: Bug Components: ha Affects Versions: 2.2.0 Reporter: Vinayakumar B Assignee: Vinayakumar B Priority: Critical Fix For: 2.5.0 Attachments: HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch Following corner scenario happened in one of our cluster. 1. NN1 was Active and NN2 was Standby 2. NN2 machine's network was slow 3. NN1 got shutdown. 4. NN2 ZKFC got the notification and trying to check for old active for fencing. (This took little more time, again due to slow network) 5. In between, NN1 got restarted by our automatic monitoring, and ZKFC made it Active. 6. Now NN2 ZKFC got Old Active as NN1 and it did graceful fencing of NN1 to STANBY. 7. Before writing ActiveBreadCrumb to ZK, NN2 ZKFC got session timeout and got shutdown before making NN2 Active. *Now cluster having both NameNodes as STANDBY.* NN1 ZKFC still thinks that its nameNode is in Active state. NN2 ZKFC waiting for election. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-10251) Both NameNodes could be in STANDBY State if SNN network is unstable
[ https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14582800#comment-14582800 ] lvchuanwen commented on HADOOP-10251: - Use Cases: 1.NN1 was Active and NN2 was Standby ,kill NN1 . NN2 transition to active. 2.hadoop-daemon.sh start namenode NN2. NOW.NN1 was Standby and NN2 was Active . 3.kill NN2 ,NN1 transition to active. Attaching hdfs-nn1-zkfc-host195.log file and hdfs-nn2-zkfc-host196.log file Both NameNodes could be in STANDBY State if SNN network is unstable --- Key: HADOOP-10251 URL: https://issues.apache.org/jira/browse/HADOOP-10251 Project: Hadoop Common Issue Type: Bug Components: ha Affects Versions: 2.2.0 Reporter: Vinayakumar B Assignee: Vinayakumar B Priority: Critical Fix For: 2.5.0 Attachments: HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch Following corner scenario happened in one of our cluster. 1. NN1 was Active and NN2 was Standby 2. NN2 machine's network was slow 3. NN1 got shutdown. 4. NN2 ZKFC got the notification and trying to check for old active for fencing. (This took little more time, again due to slow network) 5. In between, NN1 got restarted by our automatic monitoring, and ZKFC made it Active. 6. Now NN2 ZKFC got Old Active as NN1 and it did graceful fencing of NN1 to STANBY. 7. Before writing ActiveBreadCrumb to ZK, NN2 ZKFC got session timeout and got shutdown before making NN2 Active. *Now cluster having both NameNodes as STANDBY.* NN1 ZKFC still thinks that its nameNode is in Active state. NN2 ZKFC waiting for election. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-10251) Both NameNodes could be in STANDBY State if SNN network is unstable
[ https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14582751#comment-14582751 ] Xinglong.Li commented on HADOOP-10251: -- 顶! Both NameNodes could be in STANDBY State if SNN network is unstable --- Key: HADOOP-10251 URL: https://issues.apache.org/jira/browse/HADOOP-10251 Project: Hadoop Common Issue Type: Bug Components: ha Affects Versions: 2.2.0 Reporter: Vinayakumar B Assignee: Vinayakumar B Priority: Critical Fix For: 2.5.0 Attachments: HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch Following corner scenario happened in one of our cluster. 1. NN1 was Active and NN2 was Standby 2. NN2 machine's network was slow 3. NN1 got shutdown. 4. NN2 ZKFC got the notification and trying to check for old active for fencing. (This took little more time, again due to slow network) 5. In between, NN1 got restarted by our automatic monitoring, and ZKFC made it Active. 6. Now NN2 ZKFC got Old Active as NN1 and it did graceful fencing of NN1 to STANBY. 7. Before writing ActiveBreadCrumb to ZK, NN2 ZKFC got session timeout and got shutdown before making NN2 Active. *Now cluster having both NameNodes as STANDBY.* NN1 ZKFC still thinks that its nameNode is in Active state. NN2 ZKFC waiting for election. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-10251) Both NameNodes could be in STANDBY State if SNN network is unstable
[ https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14582802#comment-14582802 ] lvchuanwen commented on HADOOP-10251: - hdfs-nn1-zkfc-host195.log 2015-06-12 02:25:38,799 WARN org.apache.hadoop.ha.HealthMonitor: Transport-level exception trying to monitor health of NameNode at zdh195/10.43.156.195:9000: Failed on local exception: java.io.EOFException; Host Details : local host is: zdh195/10.43.156.195; destination host is: zdh195:9000; 2015-06-12 02:25:38,799 INFO org.apache.hadoop.ha.HealthMonitor: Entering state SERVICE_NOT_RESPONDING 2015-06-12 02:25:38,800 INFO org.apache.hadoop.ha.ZKFailoverController: Local service NameNode at zdh195/10.43.156.195:9000 entered state: SERVICE_NOT_RESPONDING 2015-06-12 02:25:38,800 INFO org.apache.hadoop.ha.ZKFailoverController: Quitting master election for NameNode at zdh195/10.43.156.195:9000 and marking that fencing is necessary 2015-06-12 02:25:38,800 INFO org.apache.hadoop.ha.ActiveStandbyElector: Yielding from election 2015-06-12 02:25:38,803 INFO org.apache.zookeeper.ZooKeeper: Session: 0x24d91acb5cb1c33 closed 2015-06-12 02:25:38,803 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x24d91acb5cb1c33 2015-06-12 02:25:38,803 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down 2015-06-12 02:25:40,805 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: zdh195/10.43.156.195:9000. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000 MILLISECONDS) 2015-06-12 02:25:40,807 WARN org.apache.hadoop.ha.HealthMonitor: Transport-level exception trying to monitor health of NameNode at zdh195/10.43.156.195:9000: Call From zdh195/10.43.156.195 to zdh195:9000 failed on connection exception: java.net.ConnectException: 拒绝连接; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused ... ... ... 2015-06-12 02:27:22,976 WARN org.apache.hadoop.ha.HealthMonitor: Transport-level exception trying to monitor health of NameNode at zdh195/10.43.156.195:9000: Call From zdh195/10.43.156.195 to zdh195:9000 failed on connection exception: java.net.ConnectException: 拒绝连接; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused 2015-06-12 02:27:24,978 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: zdh195/10.43.156.195:9000. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000 MILLISECONDS) 2015-06-12 02:27:24,979 WARN org.apache.hadoop.ha.HealthMonitor: Transport-level exception trying to monitor health of NameNode at zdh195/10.43.156.195:9000: Call From zdh195/10.43.156.195 to zdh195:9000 failed on connection exception: java.net.ConnectException: 拒绝连接; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused 2015-06-12 02:27:26,981 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: zdh195/10.43.156.195:9000. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000 MILLISECONDS) 2015-06-12 02:27:27,810 INFO org.apache.hadoop.ha.HealthMonitor: Entering state SERVICE_HEALTHY 2015-06-12 02:27:27,810 INFO org.apache.hadoop.ha.ZKFailoverController: Local service NameNode at zdh195/10.43.156.195:9000 entered state: SERVICE_HEALTHY 2015-06-12 02:27:27,811 INFO org.apache.zookeeper.ZooKeeper: Initiating client connection, connectString=zdh196:2181,zdh195:2181,zdh197:2181 sessionTimeout=1 watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@651a6959 2015-06-12 02:27:27,812 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server zdh195/10.43.156.195:2181. Will not attempt to authenticate using SASL (unknown error) 2015-06-12 02:27:27,812 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to zdh195/10.43.156.195:2181, initiating session 2015-06-12 02:27:27,814 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server zdh195/10.43.156.195:2181, sessionid = 0x24d91acb5cb1c51, negotiated timeout = 1 2015-06-12 02:27:27,815 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session connected. 2015-06-12 02:27:27,816 INFO org.apache.hadoop.ha.ZKFailoverController: ZK Election indicated that NameNode at zdh195/10.43.156.195:9000 should become standby 2015-06-12 02:27:27,824 INFO org.apache.hadoop.ha.ZKFailoverController: Successfully transitioned NameNode at zdh195/10.43.156.195:9000 to standby state 2015-06-12 02:27:48,662 INFO org.apache.hadoop.ha.ActiveStandbyElector: Checking for any old active which needs to be fenced... 2015-06-12 02:27:48,663 INFO org.apache.hadoop.ha.ActiveStandbyElector: Old node exists: 0a0a636c757374657231393512036e6e321a067a646831393620a84628d33e 2015-06-12 02:27:48,665 INFO
[jira] [Commented] (HADOOP-10251) Both NameNodes could be in STANDBY State if SNN network is unstable
[ https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14582806#comment-14582806 ] lvchuanwen commented on HADOOP-10251: - ##hdfs-nn2-zkfc-host196.log file### 2015-06-12 02:25:54,146 INFO org.apache.hadoop.ha.ActiveStandbyElector: Checking for any old active which needs to be fenced... 2015-06-12 02:25:54,147 INFO org.apache.hadoop.ha.ActiveStandbyElector: Old node exists: 0a0a636c757374657231393512036e6e311a067a646831393520a84628d33e 2015-06-12 02:25:54,149 INFO org.apache.hadoop.ha.ZKFailoverController: Should fence: NameNode at zdh195/10.43.156.195:9000 2015-06-12 02:25:55,151 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: zdh195/10.43.156.195:9000. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000 MILLISECONDS) 2015-06-12 02:25:55,152 WARN org.apache.hadoop.ha.FailoverController: Unable to gracefully make NameNode at zdh195/10.43.156.195:9000 standby (unable to connect) java.net.ConnectException: Call From zdh196/10.43.156.196 to zdh195:9000 failed on connection exception: java.net.ConnectException: 拒绝连接; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused at sun.reflect.GeneratedConstructorAccessor24.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783) at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730) at org.apache.hadoop.ipc.Client.call(Client.java:1415) at org.apache.hadoop.ipc.Client.call(Client.java:1364) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) at com.sun.proxy.$Proxy12.transitionToStandby(Unknown Source) at org.apache.hadoop.ha.protocolPB.HAServiceProtocolClientSideTranslatorPB.transitionToStandby(HAServiceProtocolClientSideTranslatorPB.java:112) at org.apache.hadoop.ha.FailoverController.tryGracefulFence(FailoverController.java:172) at org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:516) at org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:507) at org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:61) at org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:894) at org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:901) at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:800) at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:605) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:499) Caused by: java.net.ConnectException: 拒绝连接 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493) at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:606) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:700) at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:367) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1463) at org.apache.hadoop.ipc.Client.call(Client.java:1382) ... 14 more 2015-06-12 02:25:55,153 INFO org.apache.hadoop.ha.NodeFencer: == Beginning Service Fencing Process... == 2015-06-12 02:25:55,154 INFO org.apache.hadoop.ha.NodeFencer: Trying method 1/2: org.apache.hadoop.ha.SshFenceByTcpPort(null) 2015-06-12 02:25:55,157 INFO org.apache.hadoop.ha.SshFenceByTcpPort: Connecting to zdh195... 2015-06-12 02:25:55,157 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Connecting to zdh195 port 22 2015-06-12 02:25:55,159 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Connection established 2015-06-12 02:25:55,188 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Remote version string: SSH-2.0-OpenSSH_5.3 2015-06-12 02:25:55,188 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Local version string: SSH-2.0-JSCH-0.1.42 2015-06-12 02:25:55,189 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: CheckCiphers:
[jira] [Commented] (HADOOP-10251) Both NameNodes could be in STANDBY State if SNN network is unstable
[ https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14581665#comment-14581665 ] lvchuanwen commented on HADOOP-10251: - version 2.5.0. volatile is present Both NameNodes could be in STANDBY State if SNN network is unstable --- Key: HADOOP-10251 URL: https://issues.apache.org/jira/browse/HADOOP-10251 Project: Hadoop Common Issue Type: Bug Components: ha Affects Versions: 2.2.0 Reporter: Vinayakumar B Assignee: Vinayakumar B Priority: Critical Fix For: 2.5.0 Attachments: HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch Following corner scenario happened in one of our cluster. 1. NN1 was Active and NN2 was Standby 2. NN2 machine's network was slow 3. NN1 got shutdown. 4. NN2 ZKFC got the notification and trying to check for old active for fencing. (This took little more time, again due to slow network) 5. In between, NN1 got restarted by our automatic monitoring, and ZKFC made it Active. 6. Now NN2 ZKFC got Old Active as NN1 and it did graceful fencing of NN1 to STANBY. 7. Before writing ActiveBreadCrumb to ZK, NN2 ZKFC got session timeout and got shutdown before making NN2 Active. *Now cluster having both NameNodes as STANDBY.* NN1 ZKFC still thinks that its nameNode is in Active state. NN2 ZKFC waiting for election. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-10251) Both NameNodes could be in STANDBY State if SNN network is unstable
[ https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14578728#comment-14578728 ] Vinayakumar B commented on HADOOP-10251: bq. After removing the code // healthMonitor.addServiceStateCallback(new ServiceStateCallBacks()); failover command recovery normal i did not understand the problem, Can you elaborate more please? Both NameNodes could be in STANDBY State if SNN network is unstable --- Key: HADOOP-10251 URL: https://issues.apache.org/jira/browse/HADOOP-10251 Project: Hadoop Common Issue Type: Bug Components: ha Affects Versions: 2.2.0 Reporter: Vinayakumar B Assignee: Vinayakumar B Priority: Critical Fix For: 2.5.0 Attachments: HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch Following corner scenario happened in one of our cluster. 1. NN1 was Active and NN2 was Standby 2. NN2 machine's network was slow 3. NN1 got shutdown. 4. NN2 ZKFC got the notification and trying to check for old active for fencing. (This took little more time, again due to slow network) 5. In between, NN1 got restarted by our automatic monitoring, and ZKFC made it Active. 6. Now NN2 ZKFC got Old Active as NN1 and it did graceful fencing of NN1 to STANBY. 7. Before writing ActiveBreadCrumb to ZK, NN2 ZKFC got session timeout and got shutdown before making NN2 Active. *Now cluster having both NameNodes as STANDBY.* NN1 ZKFC still thinks that its nameNode is in Active state. NN2 ZKFC waiting for election. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-10251) Both NameNodes could be in STANDBY State if SNN network is unstable
[ https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14579884#comment-14579884 ] lvchuanwen commented on HADOOP-10251: - Attaching some logs: 2015-06-10 02:57:53,727 DEBUG org.apache.hadoop.ipc.Server: Successfully authorized userInfo { effectiveUser: hdfs } protocol: org.apache.hadoop.ha.ZKFCProtocol 2015-06-10 02:57:53,728 DEBUG org.apache.hadoop.ipc.Server: got #0 2015-06-10 02:57:53,728 DEBUG org.apache.hadoop.ipc.Server: IPC Server handler 2 on 8019: org.apache.hadoop.ha.ZKFCProtocol.gracefulFailover from 10.43.156.196:49132 Call#0 Retry#0 for RpcKind RPC_PROTOCOL_BUFFER 2015-06-10 02:57:53,730 DEBUG org.apache.hadoop.security.UserGroupInformation: PrivilegedAction as:hdfs (auth:SIMPLE) from:org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) 2015-06-10 02:57:53,755 DEBUG org.apache.hadoop.security.Groups: Returning fetched groups for 'hdfs' 2015-06-10 02:57:53,755 INFO org.apache.hadoop.hdfs.tools.DFSZKFailoverController: Allowed RPC access from hdfs (auth:SIMPLE) at 10.43.156.196 2015-06-10 02:57:53,756 DEBUG org.apache.hadoop.security.UserGroupInformation: PrivilegedAction as:hdfs (auth:SIMPLE) from:org.apache.hadoop.ha.ZKFailoverController.gracefulFailoverToYou(ZKFailoverController.java:603) 2015-06-10 02:57:53,760 DEBUG org.apache.zookeeper.ClientCnxn: Reading reply sessionid:0x34dcf74b50a05d7, packet:: clientPath:null serverPath:null finished:false header:: 4,4 replyHeader:: 4,38654797504,0 request:: '/hadoop-ha/cluster195/ActiveStandbyElectorLock,F response:: #aa636c75737465723139351236e6e321a67a646831393620ffa84628ffd33e,s{38654797500,38654797500,1433876199859,1433876199859,0,0,0,93891338261633392,31,0,38654797500} 2015-06-10 02:57:53,768 DEBUG org.apache.hadoop.hdfs.server.namenode.NameNode: Setting fs.defaultFS to hdfs://zdh196:9000 2015-06-10 02:57:53,770 INFO org.apache.hadoop.ha.ZKFailoverController: Asking NameNode at zdh196/10.43.156.196:9000 to cede its active state for 1ms 2015-06-10 02:57:53,772 DEBUG org.apache.hadoop.ipc.Client: getting client out of cache: org.apache.hadoop.ipc.Client@7007cf85 2015-06-10 02:57:53,779 DEBUG org.apache.hadoop.ipc.Client: The ping interval is 6 ms. 2015-06-10 02:57:53,779 DEBUG org.apache.hadoop.ipc.Client: Connecting to zdh196/10.43.156.196:8019 2015-06-10 02:57:53,781 DEBUG org.apache.hadoop.ipc.Client: IPC Client (256152889) connection to zdh196/10.43.156.196:8019 from hdfs: starting, having connections 2 2015-06-10 02:57:53,781 DEBUG org.apache.hadoop.ipc.Client: IPC Client (256152889) connection to zdh196/10.43.156.196:8019 from hdfs sending #147 2015-06-10 02:57:53,969 DEBUG org.apache.zookeeper.ClientCnxn: Got notification sessionid:0x34dcf74b50a05d7 2015-06-10 02:57:53,970 DEBUG org.apache.zookeeper.ClientCnxn: Got WatchedEvent state:SyncConnected type:NodeDeleted path:/hadoop-ha/cluster195/ActiveStandbyElectorLock for sessionid 0x34dcf74b50a05d7 2015-06-10 02:57:53,974 DEBUG org.apache.hadoop.ha.ActiveStandbyElector: Watcher event type: NodeDeleted with state:SyncConnected for path:/hadoop-ha/cluster195/ActiveStandbyElectorLock connectionState: CONNECTED for elector id=1893177739 appData=0a0a636c757374657231393512036e6e311a067a646831393520a84628d33e cb=Elector callbacks for NameNode at zdh195/10.43.156.195:9000 2015-06-10 02:57:53,979 DEBUG org.apache.hadoop.ipc.Client: IPC Client (256152889) connection to zdh196/10.43.156.196:8019 from hdfs got value #147 2015-06-10 02:57:53,979 DEBUG org.apache.zookeeper.ClientCnxn: Reading reply sessionid:0x34dcf74b50a05d7, packet:: clientPath:/hadoop-ha/cluster195/ActiveStandbyElectorLock serverPath:/hadoop-ha/cluster195/ActiveStandbyElectorLock finished:false header:: 5,1 replyHeader:: 5,38654797507,0 request:: '/hadoop-ha/cluster195/ActiveStandbyElectorLock,#aa636c75737465723139351236e6e311a67a646831393520ffa84628ffd33e,v{s{31,s{'world,'anyone}}},1 response:: '/hadoop-ha/cluster195/ActiveStandbyElectorLock 2015-06-10 02:57:53,979 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine: Call: cedeActive took 200ms 2015-06-10 02:57:53,982 DEBUG org.apache.hadoop.ha.ActiveStandbyElector: CreateNode result: 0 for path: /hadoop-ha/cluster195/ActiveStandbyElectorLock connectionState: CONNECTED for elector id=1893177739 appData=0a0a636c757374657231393512036e6e311a067a646831393520a84628d33e cb=Elector callbacks for NameNode at zdh195/10.43.156.195:9000 2015-06-10 02:57:53,982 INFO org.apache.hadoop.ha.ActiveStandbyElector: Checking for any old active which needs to be fenced... 2015-06-10 02:57:53,985 DEBUG org.apache.zookeeper.ClientCnxn: Reading reply sessionid:0x34dcf74b50a05d7, packet:: clientPath:null serverPath:null finished:false header:: 6,4 replyHeader:: 6,38654797507,-101 request:: '/hadoop-ha/cluster195/ActiveBreadCrumb,F response:: 2015-06-10 02:57:53,996 INFO
[jira] [Commented] (HADOOP-10251) Both NameNodes could be in STANDBY State if SNN network is unstable
[ https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14578184#comment-14578184 ] lvchuanwen commented on HADOOP-10251: - hi,nn1 and nn2 are alternately transformed into active state, as long as running hdfs haadmin -failover nn1 nn2. After removing the code // healthMonitor.addServiceStateCallback(new ServiceStateCallBacks()); failover command recovery normal thank you . Both NameNodes could be in STANDBY State if SNN network is unstable --- Key: HADOOP-10251 URL: https://issues.apache.org/jira/browse/HADOOP-10251 Project: Hadoop Common Issue Type: Bug Components: ha Affects Versions: 2.2.0 Reporter: Vinayakumar B Assignee: Vinayakumar B Priority: Critical Fix For: 2.5.0 Attachments: HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch Following corner scenario happened in one of our cluster. 1. NN1 was Active and NN2 was Standby 2. NN2 machine's network was slow 3. NN1 got shutdown. 4. NN2 ZKFC got the notification and trying to check for old active for fencing. (This took little more time, again due to slow network) 5. In between, NN1 got restarted by our automatic monitoring, and ZKFC made it Active. 6. Now NN2 ZKFC got Old Active as NN1 and it did graceful fencing of NN1 to STANBY. 7. Before writing ActiveBreadCrumb to ZK, NN2 ZKFC got session timeout and got shutdown before making NN2 Active. *Now cluster having both NameNodes as STANDBY.* NN1 ZKFC still thinks that its nameNode is in Active state. NN2 ZKFC waiting for election. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-10251) Both NameNodes could be in STANDBY State if SNN network is unstable
[ https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13981004#comment-13981004 ] Hudson commented on HADOOP-10251: - FAILURE: Integrated in Hadoop-Hdfs-trunk #1742 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1742/]) HADOOP-10251. Both NameNodes could be in STANDBY State if SNN network is unstable. Contributed by Vinayakumar B. (umamahesh: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1589494) * /hadoop/common/trunk/hadoop-common-project/hadoop-common/CHANGES.txt * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ha/HealthMonitor.java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ha/ZKFailoverController.java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/ha/TestZKFailoverController.java Both NameNodes could be in STANDBY State if SNN network is unstable --- Key: HADOOP-10251 URL: https://issues.apache.org/jira/browse/HADOOP-10251 Project: Hadoop Common Issue Type: Bug Components: ha Affects Versions: 2.2.0 Reporter: Vinayakumar B Assignee: Vinayakumar B Priority: Critical Fix For: 3.0.0, 2.5.0 Attachments: HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch Following corner scenario happened in one of our cluster. 1. NN1 was Active and NN2 was Standby 2. NN2 machine's network was slow 3. NN1 got shutdown. 4. NN2 ZKFC got the notification and trying to check for old active for fencing. (This took little more time, again due to slow network) 5. In between, NN1 got restarted by our automatic monitoring, and ZKFC made it Active. 6. Now NN2 ZKFC got Old Active as NN1 and it did graceful fencing of NN1 to STANBY. 7. Before writing ActiveBreadCrumb to ZK, NN2 ZKFC got session timeout and got shutdown before making NN2 Active. *Now cluster having both NameNodes as STANDBY.* NN1 ZKFC still thinks that its nameNode is in Active state. NN2 ZKFC waiting for election. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HADOOP-10251) Both NameNodes could be in STANDBY State if SNN network is unstable
[ https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13979592#comment-13979592 ] Hudson commented on HADOOP-10251: - ABORTED: Integrated in Hadoop-Yarn-trunk #550 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/550/]) HADOOP-10251. Both NameNodes could be in STANDBY State if SNN network is unstable. Contributed by Vinayakumar B. (umamahesh: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1589494) * /hadoop/common/trunk/hadoop-common-project/hadoop-common/CHANGES.txt * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ha/HealthMonitor.java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ha/ZKFailoverController.java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/ha/TestZKFailoverController.java Both NameNodes could be in STANDBY State if SNN network is unstable --- Key: HADOOP-10251 URL: https://issues.apache.org/jira/browse/HADOOP-10251 Project: Hadoop Common Issue Type: Bug Components: ha Affects Versions: 2.2.0 Reporter: Vinayakumar B Assignee: Vinayakumar B Priority: Critical Fix For: 3.0.0, 2.5.0 Attachments: HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch Following corner scenario happened in one of our cluster. 1. NN1 was Active and NN2 was Standby 2. NN2 machine's network was slow 3. NN1 got shutdown. 4. NN2 ZKFC got the notification and trying to check for old active for fencing. (This took little more time, again due to slow network) 5. In between, NN1 got restarted by our automatic monitoring, and ZKFC made it Active. 6. Now NN2 ZKFC got Old Active as NN1 and it did graceful fencing of NN1 to STANBY. 7. Before writing ActiveBreadCrumb to ZK, NN2 ZKFC got session timeout and got shutdown before making NN2 Active. *Now cluster having both NameNodes as STANDBY.* NN1 ZKFC still thinks that its nameNode is in Active state. NN2 ZKFC waiting for election. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HADOOP-10251) Both NameNodes could be in STANDBY State if SNN network is unstable
[ https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13979710#comment-13979710 ] Hudson commented on HADOOP-10251: - FAILURE: Integrated in Hadoop-Mapreduce-trunk #1767 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1767/]) HADOOP-10251. Both NameNodes could be in STANDBY State if SNN network is unstable. Contributed by Vinayakumar B. (umamahesh: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1589494) * /hadoop/common/trunk/hadoop-common-project/hadoop-common/CHANGES.txt * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ha/HealthMonitor.java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ha/ZKFailoverController.java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/ha/TestZKFailoverController.java Both NameNodes could be in STANDBY State if SNN network is unstable --- Key: HADOOP-10251 URL: https://issues.apache.org/jira/browse/HADOOP-10251 Project: Hadoop Common Issue Type: Bug Components: ha Affects Versions: 2.2.0 Reporter: Vinayakumar B Assignee: Vinayakumar B Priority: Critical Fix For: 3.0.0, 2.5.0 Attachments: HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch Following corner scenario happened in one of our cluster. 1. NN1 was Active and NN2 was Standby 2. NN2 machine's network was slow 3. NN1 got shutdown. 4. NN2 ZKFC got the notification and trying to check for old active for fencing. (This took little more time, again due to slow network) 5. In between, NN1 got restarted by our automatic monitoring, and ZKFC made it Active. 6. Now NN2 ZKFC got Old Active as NN1 and it did graceful fencing of NN1 to STANBY. 7. Before writing ActiveBreadCrumb to ZK, NN2 ZKFC got session timeout and got shutdown before making NN2 Active. *Now cluster having both NameNodes as STANDBY.* NN1 ZKFC still thinks that its nameNode is in Active state. NN2 ZKFC waiting for election. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HADOOP-10251) Both NameNodes could be in STANDBY State if SNN network is unstable
[ https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13978796#comment-13978796 ] Hudson commented on HADOOP-10251: - SUCCESS: Integrated in Hadoop-trunk-Commit #5554 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/5554/]) HADOOP-10251. Both NameNodes could be in STANDBY State if SNN network is unstable. Contributed by Vinayakumar B. (umamahesh: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1589494) * /hadoop/common/trunk/hadoop-common-project/hadoop-common/CHANGES.txt * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ha/HealthMonitor.java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ha/ZKFailoverController.java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/ha/TestZKFailoverController.java Both NameNodes could be in STANDBY State if SNN network is unstable --- Key: HADOOP-10251 URL: https://issues.apache.org/jira/browse/HADOOP-10251 Project: Hadoop Common Issue Type: Bug Components: ha Affects Versions: 2.2.0 Reporter: Vinayakumar B Assignee: Vinayakumar B Priority: Critical Fix For: 3.0.0, 2.5.0 Attachments: HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch Following corner scenario happened in one of our cluster. 1. NN1 was Active and NN2 was Standby 2. NN2 machine's network was slow 3. NN1 got shutdown. 4. NN2 ZKFC got the notification and trying to check for old active for fencing. (This took little more time, again due to slow network) 5. In between, NN1 got restarted by our automatic monitoring, and ZKFC made it Active. 6. Now NN2 ZKFC got Old Active as NN1 and it did graceful fencing of NN1 to STANBY. 7. Before writing ActiveBreadCrumb to ZK, NN2 ZKFC got session timeout and got shutdown before making NN2 Active. *Now cluster having both NameNodes as STANDBY.* NN1 ZKFC still thinks that its nameNode is in Active state. NN2 ZKFC waiting for election. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HADOOP-10251) Both NameNodes could be in STANDBY State if SNN network is unstable
[ https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13977851#comment-13977851 ] Uma Maheswara Rao G commented on HADOOP-10251: -- +1 on the latest patch. I will commit it shortly. Both NameNodes could be in STANDBY State if SNN network is unstable --- Key: HADOOP-10251 URL: https://issues.apache.org/jira/browse/HADOOP-10251 Project: Hadoop Common Issue Type: Bug Components: ha Affects Versions: 2.2.0 Reporter: Vinayakumar B Assignee: Vinayakumar B Priority: Critical Attachments: HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch Following corner scenario happened in one of our cluster. 1. NN1 was Active and NN2 was Standby 2. NN2 machine's network was slow 3. NN1 got shutdown. 4. NN2 ZKFC got the notification and trying to check for old active for fencing. (This took little more time, again due to slow network) 5. In between, NN1 got restarted by our automatic monitoring, and ZKFC made it Active. 6. Now NN2 ZKFC got Old Active as NN1 and it did graceful fencing of NN1 to STANBY. 7. Before writing ActiveBreadCrumb to ZK, NN2 ZKFC got session timeout and got shutdown before making NN2 Active. *Now cluster having both NameNodes as STANDBY.* NN1 ZKFC still thinks that its nameNode is in Active state. NN2 ZKFC waiting for election. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HADOOP-10251) Both NameNodes could be in STANDBY State if SNN network is unstable
[ https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13973893#comment-13973893 ] Hadoop QA commented on HADOOP-10251: {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12640785/HADOOP-10251.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-common-project/hadoop-common. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/3809//testReport/ Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/3809//console This message is automatically generated. Both NameNodes could be in STANDBY State if SNN network is unstable --- Key: HADOOP-10251 URL: https://issues.apache.org/jira/browse/HADOOP-10251 Project: Hadoop Common Issue Type: Bug Components: ha Affects Versions: 2.2.0 Reporter: Vinayakumar B Assignee: Vinayakumar B Priority: Critical Attachments: HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch Following corner scenario happened in one of our cluster. 1. NN1 was Active and NN2 was Standby 2. NN2 machine's network was slow 3. NN1 got shutdown. 4. NN2 ZKFC got the notification and trying to check for old active for fencing. (This took little more time, again due to slow network) 5. In between, NN1 got restarted by our automatic monitoring, and ZKFC made it Active. 6. Now NN2 ZKFC got Old Active as NN1 and it did graceful fencing of NN1 to STANBY. 7. Before writing ActiveBreadCrumb to ZK, NN2 ZKFC got session timeout and got shutdown before making NN2 Active. *Now cluster having both NameNodes as STANDBY.* NN1 ZKFC still thinks that its nameNode is in Active state. NN2 ZKFC waiting for election. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HADOOP-10251) Both NameNodes could be in STANDBY State if SNN network is unstable
[ https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13973090#comment-13973090 ] Uma Maheswara Rao G commented on HADOOP-10251: -- I think your idea will work and patch almost looks good to me. One question: {code} /** + * Callback interface for service state change events. + * + * This interface is called whenever there is a change in the service state. + */ {code} Seems like this interface will be called on every health monitor status check. But doc says its a service state changed event. It is exposing impl details as you do that state comparisions in implementation and do necessary actions on state change. So, at this interface level, we are not sure whether service state changed from last state or not right. Can you update this something like Service state notification? Both NameNodes could be in STANDBY State if SNN network is unstable --- Key: HADOOP-10251 URL: https://issues.apache.org/jira/browse/HADOOP-10251 Project: Hadoop Common Issue Type: Bug Components: ha Affects Versions: 2.2.0 Reporter: Vinayakumar B Assignee: Vinayakumar B Priority: Critical Attachments: HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch Following corner scenario happened in one of our cluster. 1. NN1 was Active and NN2 was Standby 2. NN2 machine's network was slow 3. NN1 got shutdown. 4. NN2 ZKFC got the notification and trying to check for old active for fencing. (This took little more time, again due to slow network) 5. In between, NN1 got restarted by our automatic monitoring, and ZKFC made it Active. 6. Now NN2 ZKFC got Old Active as NN1 and it did graceful fencing of NN1 to STANBY. 7. Before writing ActiveBreadCrumb to ZK, NN2 ZKFC got session timeout and got shutdown before making NN2 Active. *Now cluster having both NameNodes as STANDBY.* NN1 ZKFC still thinks that its nameNode is in Active state. NN2 ZKFC waiting for election. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HADOOP-10251) Both NameNodes could be in STANDBY State if SNN network is unstable
[ https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13969340#comment-13969340 ] Vinayakumar B commented on HADOOP-10251: Hi.. Can please someone review this patch? Thanks in advance Both NameNodes could be in STANDBY State if SNN network is unstable --- Key: HADOOP-10251 URL: https://issues.apache.org/jira/browse/HADOOP-10251 Project: Hadoop Common Issue Type: Bug Components: ha Affects Versions: 2.2.0 Reporter: Vinayakumar B Assignee: Vinayakumar B Priority: Critical Attachments: HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch Following corner scenario happened in one of our cluster. 1. NN1 was Active and NN2 was Standby 2. NN2 machine's network was slow 3. NN1 got shutdown. 4. NN2 ZKFC got the notification and trying to check for old active for fencing. (This took little more time, again due to slow network) 5. In between, NN1 got restarted by our automatic monitoring, and ZKFC made it Active. 6. Now NN2 ZKFC got Old Active as NN2 and it did graceful fencing of NN1 to STANBY. 7. Before writing ActiveBreadCrumb to ZK, NN2 ZKFC got session timeout and got shutdown before making NN2 Active. *Now cluster having both NameNodes as STANDBY.* NN1 ZKFC still thinks that its nameNode is in Active state. NN2 ZKFC waiting for election. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HADOOP-10251) Both NameNodes could be in STANDBY State if SNN network is unstable
[ https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13913016#comment-13913016 ] Vinayakumar B commented on HADOOP-10251: Hi.. Can please someone review this patch? Thanks Both NameNodes could be in STANDBY State if SNN network is unstable --- Key: HADOOP-10251 URL: https://issues.apache.org/jira/browse/HADOOP-10251 Project: Hadoop Common Issue Type: Bug Components: ha Affects Versions: 2.2.0 Reporter: Vinayakumar B Assignee: Vinayakumar B Priority: Critical Attachments: HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch Following corner scenario happened in one of our cluster. 1. NN1 was Active and NN2 was Standby 2. NN2 machine's network was slow 3. NN1 got shutdown. 4. NN2 ZKFC got the notification and trying to check for old active for fencing. (This took little more time, again due to slow network) 5. In between, NN1 got restarted by our automatic monitoring, and ZKFC made it Active. 6. Now NN2 ZKFC got Old Active as NN2 and it did graceful fencing of NN1 to STANBY. 7. Before writing ActiveBreadCrumb to ZK, NN2 ZKFC got session timeout and got shutdown before making NN2 Active. *Now cluster having both NameNodes as STANDBY.* NN1 ZKFC still thinks that its nameNode is in Active state. NN2 ZKFC waiting for election. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HADOOP-10251) Both NameNodes could be in STANDBY State if SNN network is unstable
[ https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13890860#comment-13890860 ] Vinay commented on HADOOP-10251: Hi, Can someone please review the patch.. Thanks Both NameNodes could be in STANDBY State if SNN network is unstable --- Key: HADOOP-10251 URL: https://issues.apache.org/jira/browse/HADOOP-10251 Project: Hadoop Common Issue Type: Bug Components: ha Affects Versions: 2.2.0 Reporter: Vinay Assignee: Vinay Priority: Critical Attachments: HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch Following corner scenario happened in one of our cluster. 1. NN1 was Active and NN2 was Standby 2. NN2 machine's network was slow 3. NN1 got shutdown. 4. NN2 ZKFC got the notification and trying to check for old active for fencing. (This took little more time, again due to slow network) 5. In between, NN1 got restarted by our automatic monitoring, and ZKFC made it Active. 6. Now NN2 ZKFC got Old Active as NN2 and it did graceful fencing of NN1 to STANBY. 7. Before writing ActiveBreadCrumb to ZK, NN2 ZKFC got session timeout and got shutdown before making NN2 Active. *Now cluster having both NameNodes as STANDBY.* NN1 ZKFC still thinks that its nameNode is in Active state. NN2 ZKFC waiting for election. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HADOOP-10251) Both NameNodes could be in STANDBY State if SNN network is unstable
[ https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13890924#comment-13890924 ] Hadoop QA commented on HADOOP-10251: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12624583/HADOOP-10251.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:red}-1 javadoc{color}. The javadoc tool appears to have generated 2 warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-common-project/hadoop-common. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/3528//testReport/ Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/3528//console This message is automatically generated. Both NameNodes could be in STANDBY State if SNN network is unstable --- Key: HADOOP-10251 URL: https://issues.apache.org/jira/browse/HADOOP-10251 Project: Hadoop Common Issue Type: Bug Components: ha Affects Versions: 2.2.0 Reporter: Vinay Assignee: Vinay Priority: Critical Attachments: HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch Following corner scenario happened in one of our cluster. 1. NN1 was Active and NN2 was Standby 2. NN2 machine's network was slow 3. NN1 got shutdown. 4. NN2 ZKFC got the notification and trying to check for old active for fencing. (This took little more time, again due to slow network) 5. In between, NN1 got restarted by our automatic monitoring, and ZKFC made it Active. 6. Now NN2 ZKFC got Old Active as NN2 and it did graceful fencing of NN1 to STANBY. 7. Before writing ActiveBreadCrumb to ZK, NN2 ZKFC got session timeout and got shutdown before making NN2 Active. *Now cluster having both NameNodes as STANDBY.* NN1 ZKFC still thinks that its nameNode is in Active state. NN2 ZKFC waiting for election. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HADOOP-10251) Both NameNodes could be in STANDBY State if SNN network is unstable
[ https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13878576#comment-13878576 ] Vinay commented on HADOOP-10251: ZKFC health check, checks the state of the NameNode, but it doesnot validate it with expected state. Both NameNodes could be in STANDBY State if SNN network is unstable --- Key: HADOOP-10251 URL: https://issues.apache.org/jira/browse/HADOOP-10251 Project: Hadoop Common Issue Type: Bug Components: ha Affects Versions: 2.2.0 Reporter: Vinay Assignee: Vinay Priority: Critical Following corner scenario happened in one of our cluster. 1. NN1 was Active and NN2 was Standby 2. NN2 machine's network was slow 3. NN1 got shutdown. 4. NN2 ZKFC got the notification and trying to check for old active for fencing. (This took little more time, again due to slow network) 5. In between, NN1 got restarted by our automatic monitoring, and ZKFC made it Active. 6. Now NN2 ZKFC got Old Active as NN2 and it did graceful fencing of NN1 to STANBY. 7. Before writing ActiveBreadCrumb to ZK, NN2 ZKFC got session timeout and got shutdown before making NN2 Active. *Now cluster having both NameNodes as STANDBY.* NN1 ZKFC still thinks that its nameNode is in Active state. NN2 ZKFC waiting for election. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HADOOP-10251) Both NameNodes could be in STANDBY State if SNN network is unstable
[ https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13878739#comment-13878739 ] Hadoop QA commented on HADOOP-10251: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12624335/HADOOP-10251.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-common-project/hadoop-common: org.apache.hadoop.ha.TestZKFailoverController The following test timeouts occurred in hadoop-common-project/hadoop-common: org.apache.hadoop.ha.TestZKFailoverControllerStress {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/3457//testReport/ Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/3457//console This message is automatically generated. Both NameNodes could be in STANDBY State if SNN network is unstable --- Key: HADOOP-10251 URL: https://issues.apache.org/jira/browse/HADOOP-10251 Project: Hadoop Common Issue Type: Bug Components: ha Affects Versions: 2.2.0 Reporter: Vinay Assignee: Vinay Priority: Critical Attachments: HADOOP-10251.patch Following corner scenario happened in one of our cluster. 1. NN1 was Active and NN2 was Standby 2. NN2 machine's network was slow 3. NN1 got shutdown. 4. NN2 ZKFC got the notification and trying to check for old active for fencing. (This took little more time, again due to slow network) 5. In between, NN1 got restarted by our automatic monitoring, and ZKFC made it Active. 6. Now NN2 ZKFC got Old Active as NN2 and it did graceful fencing of NN1 to STANBY. 7. Before writing ActiveBreadCrumb to ZK, NN2 ZKFC got session timeout and got shutdown before making NN2 Active. *Now cluster having both NameNodes as STANDBY.* NN1 ZKFC still thinks that its nameNode is in Active state. NN2 ZKFC waiting for election. -- This message was sent by Atlassian JIRA (v6.1.5#6160)