[jira] [Commented] (HADOOP-10251) Both NameNodes could be in STANDBY State if SNN network is unstable

2015-06-12 Thread Vinayakumar B (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14583083#comment-14583083
 ] 

Vinayakumar B commented on HADOOP-10251:


bq. Answer 2.when start only one ZKFC and NN ,the NN can be staying in ACTIVE 
for long time.
Yes, fine.

Its because, patch from this issue, is not merged properly to your code.
In your source code, {{becomeActive()}} doesn't have line {{serviceState = 
HAServiceState.ACTIVE;}}.

{code}private synchronized void becomeActive() throws ServiceFailedException {
 LOG.info(Trying to make  + localTarget +  active...);
 try
 {
  HAServiceProtocolHelper.transitionToActive(localTarget.getProxy( conf, 
FailoverController.getRpcTimeoutToNewActive(conf)), createReqInfo());
  String msg = Successfully transitioned  + localTarget +  to active state;
  LOG.info(msg);
  recordActiveAttempt(new ActiveAttemptRecord(true, msg));
 } catch (Throwable t) {
   String msg = Couldn't make  + localTarget +  active;
   LOG.fatal(msg, t);
   recordActiveAttempt(new ActiveAttemptRecord(false, msg + \n +
 StringUtils.stringifyException(t)));
   if (t instanceof ServiceFailedException) {
  throw (ServiceFailedException)t; }
   else {
  throw new ServiceFailedException(Couldn't transition to active, t);
   }
/*
TODO:
we need to make sure that if we get fenced and then quickly restarted,
none of these calls will retry across the restart boundary
perhaps the solution is that, whenever the nn starts, it gets a unique
ID, and when we start becoming active, we record it, and then any future
calls use the same ID
*/
 }
}{code}


So if the previous state of NameNode is not STANDBY, then it will stay for long 
time. But if its trasitioned from STANDBY, it will continously switch.

Add {{serviceState = HAServiceState.ACTIVE;}} in {{becomeActive()}}  after 
{{LOG.info(msg);}}, everything will be fine.

 Both NameNodes could be in STANDBY State if SNN network is unstable
 ---

 Key: HADOOP-10251
 URL: https://issues.apache.org/jira/browse/HADOOP-10251
 Project: Hadoop Common
  Issue Type: Bug
  Components: ha
Affects Versions: 2.2.0
Reporter: Vinayakumar B
Assignee: Vinayakumar B
Priority: Critical
 Fix For: 2.5.0

 Attachments: HADOOP-10251.patch, HADOOP-10251.patch, 
 HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch


 Following corner scenario happened in one of our cluster.
 1. NN1 was Active and NN2 was Standby
 2. NN2 machine's network was slow 
 3. NN1 got shutdown.
 4. NN2 ZKFC got the notification and trying to check for old active for 
 fencing. (This took little more time, again due to slow network)
 5. In between, NN1 got restarted by our automatic monitoring, and ZKFC made 
 it Active.
 6. Now NN2 ZKFC got Old Active as NN1 and it did graceful fencing of NN1 to 
 STANBY.
 7. Before writing ActiveBreadCrumb to ZK, NN2 ZKFC got session timeout and 
 got shutdown before making NN2 Active.
 *Now cluster having both NameNodes as STANDBY.*
 NN1 ZKFC still thinks that its nameNode is in Active state. 
 NN2 ZKFC waiting for election.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-10251) Both NameNodes could be in STANDBY State if SNN network is unstable

2015-06-12 Thread lvchuanwen (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14583029#comment-14583029
 ] 

lvchuanwen commented on HADOOP-10251:
-

/**
 * Licensed to the Apache Software Foundation (ASF) under one
 * or more contributor license agreements.  See the NOTICE file
 * distributed with this work for additional information
 * regarding copyright ownership.  The ASF licenses this file
 * to you under the Apache License, Version 2.0 (the
 * License); you may not use this file except in compliance
 * with the License.  You may obtain a copy of the License at
 *
 * http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an AS IS BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
package org.apache.hadoop.ha;

import java.io.IOException;
import java.net.InetSocketAddress;
import java.security.PrivilegedAction;
import java.security.PrivilegedExceptionAction;
import java.util.Collections;
import java.util.List;
import java.util.concurrent.Executors;
import java.util.concurrent.ScheduledExecutorService;
import java.util.concurrent.TimeUnit;

import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.hadoop.HadoopIllegalArgumentException;
import org.apache.hadoop.classification.InterfaceAudience;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.ha.ActiveStandbyElector.ActiveNotFoundException;
import org.apache.hadoop.ha.ActiveStandbyElector.ActiveStandbyElectorCallback;
import org.apache.hadoop.ha.HAServiceProtocol.HAServiceState;
import org.apache.hadoop.ha.HAServiceProtocol.StateChangeRequestInfo;
import org.apache.hadoop.ha.HAServiceProtocol.RequestSource;
import org.apache.hadoop.util.ZKUtil;
import org.apache.hadoop.util.ZKUtil.ZKAuthInfo;
import org.apache.hadoop.ha.HealthMonitor.State;
import org.apache.hadoop.ipc.Server;
import org.apache.hadoop.security.AccessControlException;
import org.apache.hadoop.security.SecurityUtil;
import org.apache.hadoop.security.UserGroupInformation;
import org.apache.hadoop.security.authorize.PolicyProvider;
import org.apache.hadoop.util.StringUtils;
import org.apache.zookeeper.KeeperException;
import org.apache.zookeeper.ZooDefs.Ids;
import org.apache.hadoop.util.ToolRunner;
import org.apache.zookeeper.data.ACL;

import com.google.common.annotations.VisibleForTesting;
import com.google.common.base.Preconditions;
import com.google.common.base.Throwables;
import com.google.common.util.concurrent.ThreadFactoryBuilder;


@InterfaceAudience.LimitedPrivate(HDFS)
public abstract class ZKFailoverController {

  static final Log LOG = LogFactory.getLog(ZKFailoverController.class);
  
  public static final String ZK_QUORUM_KEY = ha.zookeeper.quorum;
  private static final String ZK_SESSION_TIMEOUT_KEY = 
ha.zookeeper.session-timeout.ms;
  private static final int ZK_SESSION_TIMEOUT_DEFAULT = 5*1000;
  private static final String ZK_PARENT_ZNODE_KEY = ha.zookeeper.parent-znode;
  public static final String ZK_ACL_KEY = ha.zookeeper.acl;
  private static final String ZK_ACL_DEFAULT = world:anyone:rwcda;
  public static final String ZK_AUTH_KEY = ha.zookeeper.auth;
  static final String ZK_PARENT_ZNODE_DEFAULT = /hadoop-ha;

  /**
   * All of the conf keys used by the ZKFC. This is used in order to allow
   * them to be overridden on a per-nameservice or per-namenode basis.
   */
  protected static final String[] ZKFC_CONF_KEYS = new String[] {
ZK_QUORUM_KEY,
ZK_SESSION_TIMEOUT_KEY,
ZK_PARENT_ZNODE_KEY,
ZK_ACL_KEY,
ZK_AUTH_KEY
  };
  
  protected static final String USAGE = 
  Usage: java zkfc [ -formatZK [-force] [-nonInteractive] ];

  /** Unable to format the parent znode in ZK */
  static final int ERR_CODE_FORMAT_DENIED = 2;
  /** The parent znode doesn't exist in ZK */
  static final int ERR_CODE_NO_PARENT_ZNODE = 3;
  /** Fencing is not properly configured */
  static final int ERR_CODE_NO_FENCER = 4;
  /** Automatic failover is not enabled */
  static final int ERR_CODE_AUTO_FAILOVER_NOT_ENABLED = 5;
  /** Cannot connect to ZooKeeper */
  static final int ERR_CODE_NO_ZK = 6;
  
  protected Configuration conf;
  private String zkQuorum;
  protected final HAServiceTarget localTarget;

  private HealthMonitor healthMonitor;
  private ActiveStandbyElector elector;
  protected ZKFCRpcServer rpcServer;

  private State lastHealthState = State.INITIALIZING;

  private volatile HAServiceState serviceState = HAServiceState.INITIALIZING;

  /** Set if a fatal error occurs */
  private String fatalError = null;

  /**
   * A future nanotime before which the ZKFC will not join the election.
   * This is used during graceful failover.
   */

[jira] [Commented] (HADOOP-10251) Both NameNodes could be in STANDBY State if SNN network is unstable

2015-06-12 Thread lvchuanwen (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14583024#comment-14583024
 ] 

lvchuanwen commented on HADOOP-10251:
-

Answer 2.when start only one ZKFC and NN ,the NN  can be staying in ACTIVE for 
long time.

hdfs-nn1-zkfc-host195.log

2015-06-12 07:18:11,471 INFO 
org.apache.hadoop.hdfs.tools.DFSZKFailoverController: Failover controller 
configured for NameNode NameNode at zdh195/10.43.156.195:9000
2015-06-12 07:18:11,608 INFO org.apache.zookeeper.ZooKeeper: Client 
environment:zookeeper.version=3.4.5-cdh5.3.2--1, built on 05/15/2015 03:44 GMT
2015-06-12 07:18:11,608 INFO org.apache.zookeeper.ZooKeeper: Client 
environment:host.name=zdh195
2015-06-12 07:18:11,608 INFO org.apache.zookeeper.ZooKeeper: Client 
environment:java.version=1.7.0_55
2015-06-12 07:18:11,608 INFO org.apache.zookeeper.ZooKeeper: Client 
environment:java.vendor=Oracle Corporation
2015-06-12 07:18:11,608 INFO org.apache.zookeeper.ZooKeeper: Client 
environment:java.home=/usr/java/jdk/jre
2015-06-12 07:18:11,608 INFO org.apache.zookeeper.ZooKeeper: Client 

[jira] [Commented] (HADOOP-10251) Both NameNodes could be in STANDBY State if SNN network is unstable

2015-06-12 Thread lvchuanwen (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14583120#comment-14583120
 ] 

lvchuanwen commented on HADOOP-10251:
-

thank u very very much . i am so careless.

 Both NameNodes could be in STANDBY State if SNN network is unstable
 ---

 Key: HADOOP-10251
 URL: https://issues.apache.org/jira/browse/HADOOP-10251
 Project: Hadoop Common
  Issue Type: Bug
  Components: ha
Affects Versions: 2.2.0
Reporter: Vinayakumar B
Assignee: Vinayakumar B
Priority: Critical
 Fix For: 2.5.0

 Attachments: HADOOP-10251.patch, HADOOP-10251.patch, 
 HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch


 Following corner scenario happened in one of our cluster.
 1. NN1 was Active and NN2 was Standby
 2. NN2 machine's network was slow 
 3. NN1 got shutdown.
 4. NN2 ZKFC got the notification and trying to check for old active for 
 fencing. (This took little more time, again due to slow network)
 5. In between, NN1 got restarted by our automatic monitoring, and ZKFC made 
 it Active.
 6. Now NN2 ZKFC got Old Active as NN1 and it did graceful fencing of NN1 to 
 STANBY.
 7. Before writing ActiveBreadCrumb to ZK, NN2 ZKFC got session timeout and 
 got shutdown before making NN2 Active.
 *Now cluster having both NameNodes as STANDBY.*
 NN1 ZKFC still thinks that its nameNode is in Active state. 
 NN2 ZKFC waiting for election.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-10251) Both NameNodes could be in STANDBY State if SNN network is unstable

2015-06-11 Thread Vinayakumar B (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14581700#comment-14581700
 ] 

Vinayakumar B commented on HADOOP-10251:


There is no difference in the code for the normal auto failover and failover 
using haadmin.

Normal auto failover is working for you?

 Both NameNodes could be in STANDBY State if SNN network is unstable
 ---

 Key: HADOOP-10251
 URL: https://issues.apache.org/jira/browse/HADOOP-10251
 Project: Hadoop Common
  Issue Type: Bug
  Components: ha
Affects Versions: 2.2.0
Reporter: Vinayakumar B
Assignee: Vinayakumar B
Priority: Critical
 Fix For: 2.5.0

 Attachments: HADOOP-10251.patch, HADOOP-10251.patch, 
 HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch


 Following corner scenario happened in one of our cluster.
 1. NN1 was Active and NN2 was Standby
 2. NN2 machine's network was slow 
 3. NN1 got shutdown.
 4. NN2 ZKFC got the notification and trying to check for old active for 
 fencing. (This took little more time, again due to slow network)
 5. In between, NN1 got restarted by our automatic monitoring, and ZKFC made 
 it Active.
 6. Now NN2 ZKFC got Old Active as NN1 and it did graceful fencing of NN1 to 
 STANBY.
 7. Before writing ActiveBreadCrumb to ZK, NN2 ZKFC got session timeout and 
 got shutdown before making NN2 Active.
 *Now cluster having both NameNodes as STANDBY.*
 NN1 ZKFC still thinks that its nameNode is in Active state. 
 NN2 ZKFC waiting for election.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-10251) Both NameNodes could be in STANDBY State if SNN network is unstable

2015-06-11 Thread lvchuanwen (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14581725#comment-14581725
 ] 

lvchuanwen commented on HADOOP-10251:
-

auto failover is normal . when i kill the process of active namenode nn1 ,nn2 
can transition to active .
I find if the key ha.health-monitor.check-interval.ms setting is two different 
values. hdfs haadmin is no problem.
nn1 setting
property
nameha.health-monitor.check-interval.ms/name
value2000/value
/property
nn2 setting
property
nameha.health-monitor.check-interval.ms/name
value1000/value
/property


 Both NameNodes could be in STANDBY State if SNN network is unstable
 ---

 Key: HADOOP-10251
 URL: https://issues.apache.org/jira/browse/HADOOP-10251
 Project: Hadoop Common
  Issue Type: Bug
  Components: ha
Affects Versions: 2.2.0
Reporter: Vinayakumar B
Assignee: Vinayakumar B
Priority: Critical
 Fix For: 2.5.0

 Attachments: HADOOP-10251.patch, HADOOP-10251.patch, 
 HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch


 Following corner scenario happened in one of our cluster.
 1. NN1 was Active and NN2 was Standby
 2. NN2 machine's network was slow 
 3. NN1 got shutdown.
 4. NN2 ZKFC got the notification and trying to check for old active for 
 fencing. (This took little more time, again due to slow network)
 5. In between, NN1 got restarted by our automatic monitoring, and ZKFC made 
 it Active.
 6. Now NN2 ZKFC got Old Active as NN1 and it did graceful fencing of NN1 to 
 STANBY.
 7. Before writing ActiveBreadCrumb to ZK, NN2 ZKFC got session timeout and 
 got shutdown before making NN2 Active.
 *Now cluster having both NameNodes as STANDBY.*
 NN1 ZKFC still thinks that its nameNode is in Active state. 
 NN2 ZKFC waiting for election.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-10251) Both NameNodes could be in STANDBY State if SNN network is unstable

2015-06-11 Thread Vinayakumar B (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14581735#comment-14581735
 ] 

Vinayakumar B commented on HADOOP-10251:


bq. when i kill the process of active namenode nn1 ,nn2 can transition to 
active .
When you kill active nn2, whether nn1 is transitioning to active?

 Both NameNodes could be in STANDBY State if SNN network is unstable
 ---

 Key: HADOOP-10251
 URL: https://issues.apache.org/jira/browse/HADOOP-10251
 Project: Hadoop Common
  Issue Type: Bug
  Components: ha
Affects Versions: 2.2.0
Reporter: Vinayakumar B
Assignee: Vinayakumar B
Priority: Critical
 Fix For: 2.5.0

 Attachments: HADOOP-10251.patch, HADOOP-10251.patch, 
 HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch


 Following corner scenario happened in one of our cluster.
 1. NN1 was Active and NN2 was Standby
 2. NN2 machine's network was slow 
 3. NN1 got shutdown.
 4. NN2 ZKFC got the notification and trying to check for old active for 
 fencing. (This took little more time, again due to slow network)
 5. In between, NN1 got restarted by our automatic monitoring, and ZKFC made 
 it Active.
 6. Now NN2 ZKFC got Old Active as NN1 and it did graceful fencing of NN1 to 
 STANBY.
 7. Before writing ActiveBreadCrumb to ZK, NN2 ZKFC got session timeout and 
 got shutdown before making NN2 Active.
 *Now cluster having both NameNodes as STANDBY.*
 NN1 ZKFC still thinks that its nameNode is in Active state. 
 NN2 ZKFC waiting for election.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-10251) Both NameNodes could be in STANDBY State if SNN network is unstable

2015-06-11 Thread lvchuanwen (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14581766#comment-14581766
 ] 

lvchuanwen commented on HADOOP-10251:
-

yes .hadoop-deamon.sh start namenode nn1 then kill nn2 ,nn1 can  transition to 
active

 Both NameNodes could be in STANDBY State if SNN network is unstable
 ---

 Key: HADOOP-10251
 URL: https://issues.apache.org/jira/browse/HADOOP-10251
 Project: Hadoop Common
  Issue Type: Bug
  Components: ha
Affects Versions: 2.2.0
Reporter: Vinayakumar B
Assignee: Vinayakumar B
Priority: Critical
 Fix For: 2.5.0

 Attachments: HADOOP-10251.patch, HADOOP-10251.patch, 
 HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch


 Following corner scenario happened in one of our cluster.
 1. NN1 was Active and NN2 was Standby
 2. NN2 machine's network was slow 
 3. NN1 got shutdown.
 4. NN2 ZKFC got the notification and trying to check for old active for 
 fencing. (This took little more time, again due to slow network)
 5. In between, NN1 got restarted by our automatic monitoring, and ZKFC made 
 it Active.
 6. Now NN2 ZKFC got Old Active as NN1 and it did graceful fencing of NN1 to 
 STANBY.
 7. Before writing ActiveBreadCrumb to ZK, NN2 ZKFC got session timeout and 
 got shutdown before making NN2 Active.
 *Now cluster having both NameNodes as STANDBY.*
 NN1 ZKFC still thinks that its nameNode is in Active state. 
 NN2 ZKFC waiting for election.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-10251) Both NameNodes could be in STANDBY State if SNN network is unstable

2015-06-11 Thread lvchuanwen (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14581775#comment-14581775
 ] 

lvchuanwen commented on HADOOP-10251:
-

yes .hadoop-deamon.sh start namenode nn1 then kill nn2 ,nn1 can  transition to 
active

 Both NameNodes could be in STANDBY State if SNN network is unstable
 ---

 Key: HADOOP-10251
 URL: https://issues.apache.org/jira/browse/HADOOP-10251
 Project: Hadoop Common
  Issue Type: Bug
  Components: ha
Affects Versions: 2.2.0
Reporter: Vinayakumar B
Assignee: Vinayakumar B
Priority: Critical
 Fix For: 2.5.0

 Attachments: HADOOP-10251.patch, HADOOP-10251.patch, 
 HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch


 Following corner scenario happened in one of our cluster.
 1. NN1 was Active and NN2 was Standby
 2. NN2 machine's network was slow 
 3. NN1 got shutdown.
 4. NN2 ZKFC got the notification and trying to check for old active for 
 fencing. (This took little more time, again due to slow network)
 5. In between, NN1 got restarted by our automatic monitoring, and ZKFC made 
 it Active.
 6. Now NN2 ZKFC got Old Active as NN1 and it did graceful fencing of NN1 to 
 STANBY.
 7. Before writing ActiveBreadCrumb to ZK, NN2 ZKFC got session timeout and 
 got shutdown before making NN2 Active.
 *Now cluster having both NameNodes as STANDBY.*
 NN1 ZKFC still thinks that its nameNode is in Active state. 
 NN2 ZKFC waiting for election.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-10251) Both NameNodes could be in STANDBY State if SNN network is unstable

2015-06-11 Thread lvchuanwen (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14581777#comment-14581777
 ] 

lvchuanwen commented on HADOOP-10251:
-

yes .hadoop-deamon.sh start namenode nn1 then kill nn2 ,nn1 can  transition to 
active

 Both NameNodes could be in STANDBY State if SNN network is unstable
 ---

 Key: HADOOP-10251
 URL: https://issues.apache.org/jira/browse/HADOOP-10251
 Project: Hadoop Common
  Issue Type: Bug
  Components: ha
Affects Versions: 2.2.0
Reporter: Vinayakumar B
Assignee: Vinayakumar B
Priority: Critical
 Fix For: 2.5.0

 Attachments: HADOOP-10251.patch, HADOOP-10251.patch, 
 HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch


 Following corner scenario happened in one of our cluster.
 1. NN1 was Active and NN2 was Standby
 2. NN2 machine's network was slow 
 3. NN1 got shutdown.
 4. NN2 ZKFC got the notification and trying to check for old active for 
 fencing. (This took little more time, again due to slow network)
 5. In between, NN1 got restarted by our automatic monitoring, and ZKFC made 
 it Active.
 6. Now NN2 ZKFC got Old Active as NN1 and it did graceful fencing of NN1 to 
 STANBY.
 7. Before writing ActiveBreadCrumb to ZK, NN2 ZKFC got session timeout and 
 got shutdown before making NN2 Active.
 *Now cluster having both NameNodes as STANDBY.*
 NN1 ZKFC still thinks that its nameNode is in Active state. 
 NN2 ZKFC waiting for election.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-10251) Both NameNodes could be in STANDBY State if SNN network is unstable

2015-06-11 Thread lvchuanwen (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14581765#comment-14581765
 ] 

lvchuanwen commented on HADOOP-10251:
-

yes .hadoop-deamon.sh start namenode nn1 then kill nn2 ,nn1 can  transition to 
active

 Both NameNodes could be in STANDBY State if SNN network is unstable
 ---

 Key: HADOOP-10251
 URL: https://issues.apache.org/jira/browse/HADOOP-10251
 Project: Hadoop Common
  Issue Type: Bug
  Components: ha
Affects Versions: 2.2.0
Reporter: Vinayakumar B
Assignee: Vinayakumar B
Priority: Critical
 Fix For: 2.5.0

 Attachments: HADOOP-10251.patch, HADOOP-10251.patch, 
 HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch


 Following corner scenario happened in one of our cluster.
 1. NN1 was Active and NN2 was Standby
 2. NN2 machine's network was slow 
 3. NN1 got shutdown.
 4. NN2 ZKFC got the notification and trying to check for old active for 
 fencing. (This took little more time, again due to slow network)
 5. In between, NN1 got restarted by our automatic monitoring, and ZKFC made 
 it Active.
 6. Now NN2 ZKFC got Old Active as NN1 and it did graceful fencing of NN1 to 
 STANBY.
 7. Before writing ActiveBreadCrumb to ZK, NN2 ZKFC got session timeout and 
 got shutdown before making NN2 Active.
 *Now cluster having both NameNodes as STANDBY.*
 NN1 ZKFC still thinks that its nameNode is in Active state. 
 NN2 ZKFC waiting for election.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-10251) Both NameNodes could be in STANDBY State if SNN network is unstable

2015-06-11 Thread lvchuanwen (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14581763#comment-14581763
 ] 

lvchuanwen commented on HADOOP-10251:
-

yes .hadoop-deamon.sh start namenode nn1 then kill nn2 ,nn1 can  transition to 
active

 Both NameNodes could be in STANDBY State if SNN network is unstable
 ---

 Key: HADOOP-10251
 URL: https://issues.apache.org/jira/browse/HADOOP-10251
 Project: Hadoop Common
  Issue Type: Bug
  Components: ha
Affects Versions: 2.2.0
Reporter: Vinayakumar B
Assignee: Vinayakumar B
Priority: Critical
 Fix For: 2.5.0

 Attachments: HADOOP-10251.patch, HADOOP-10251.patch, 
 HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch


 Following corner scenario happened in one of our cluster.
 1. NN1 was Active and NN2 was Standby
 2. NN2 machine's network was slow 
 3. NN1 got shutdown.
 4. NN2 ZKFC got the notification and trying to check for old active for 
 fencing. (This took little more time, again due to slow network)
 5. In between, NN1 got restarted by our automatic monitoring, and ZKFC made 
 it Active.
 6. Now NN2 ZKFC got Old Active as NN1 and it did graceful fencing of NN1 to 
 STANBY.
 7. Before writing ActiveBreadCrumb to ZK, NN2 ZKFC got session timeout and 
 got shutdown before making NN2 Active.
 *Now cluster having both NameNodes as STANDBY.*
 NN1 ZKFC still thinks that its nameNode is in Active state. 
 NN2 ZKFC waiting for election.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-10251) Both NameNodes could be in STANDBY State if SNN network is unstable

2015-06-11 Thread lvchuanwen (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14581768#comment-14581768
 ] 

lvchuanwen commented on HADOOP-10251:
-

yes .hadoop-deamon.sh start namenode nn1 then kill nn2 ,nn1 can  transition to 
active

 Both NameNodes could be in STANDBY State if SNN network is unstable
 ---

 Key: HADOOP-10251
 URL: https://issues.apache.org/jira/browse/HADOOP-10251
 Project: Hadoop Common
  Issue Type: Bug
  Components: ha
Affects Versions: 2.2.0
Reporter: Vinayakumar B
Assignee: Vinayakumar B
Priority: Critical
 Fix For: 2.5.0

 Attachments: HADOOP-10251.patch, HADOOP-10251.patch, 
 HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch


 Following corner scenario happened in one of our cluster.
 1. NN1 was Active and NN2 was Standby
 2. NN2 machine's network was slow 
 3. NN1 got shutdown.
 4. NN2 ZKFC got the notification and trying to check for old active for 
 fencing. (This took little more time, again due to slow network)
 5. In between, NN1 got restarted by our automatic monitoring, and ZKFC made 
 it Active.
 6. Now NN2 ZKFC got Old Active as NN1 and it did graceful fencing of NN1 to 
 STANBY.
 7. Before writing ActiveBreadCrumb to ZK, NN2 ZKFC got session timeout and 
 got shutdown before making NN2 Active.
 *Now cluster having both NameNodes as STANDBY.*
 NN1 ZKFC still thinks that its nameNode is in Active state. 
 NN2 ZKFC waiting for election.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-10251) Both NameNodes could be in STANDBY State if SNN network is unstable

2015-06-11 Thread lvchuanwen (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14581773#comment-14581773
 ] 

lvchuanwen commented on HADOOP-10251:
-

yes .hadoop-deamon.sh start namenode nn1 then kill nn2 ,nn1 can  transition to 
active

 Both NameNodes could be in STANDBY State if SNN network is unstable
 ---

 Key: HADOOP-10251
 URL: https://issues.apache.org/jira/browse/HADOOP-10251
 Project: Hadoop Common
  Issue Type: Bug
  Components: ha
Affects Versions: 2.2.0
Reporter: Vinayakumar B
Assignee: Vinayakumar B
Priority: Critical
 Fix For: 2.5.0

 Attachments: HADOOP-10251.patch, HADOOP-10251.patch, 
 HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch


 Following corner scenario happened in one of our cluster.
 1. NN1 was Active and NN2 was Standby
 2. NN2 machine's network was slow 
 3. NN1 got shutdown.
 4. NN2 ZKFC got the notification and trying to check for old active for 
 fencing. (This took little more time, again due to slow network)
 5. In between, NN1 got restarted by our automatic monitoring, and ZKFC made 
 it Active.
 6. Now NN2 ZKFC got Old Active as NN1 and it did graceful fencing of NN1 to 
 STANBY.
 7. Before writing ActiveBreadCrumb to ZK, NN2 ZKFC got session timeout and 
 got shutdown before making NN2 Active.
 *Now cluster having both NameNodes as STANDBY.*
 NN1 ZKFC still thinks that its nameNode is in Active state. 
 NN2 ZKFC waiting for election.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-10251) Both NameNodes could be in STANDBY State if SNN network is unstable

2015-06-11 Thread lvchuanwen (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14581772#comment-14581772
 ] 

lvchuanwen commented on HADOOP-10251:
-

yes .hadoop-deamon.sh start namenode nn1 then kill nn2 ,nn1 can  transition to 
active

 Both NameNodes could be in STANDBY State if SNN network is unstable
 ---

 Key: HADOOP-10251
 URL: https://issues.apache.org/jira/browse/HADOOP-10251
 Project: Hadoop Common
  Issue Type: Bug
  Components: ha
Affects Versions: 2.2.0
Reporter: Vinayakumar B
Assignee: Vinayakumar B
Priority: Critical
 Fix For: 2.5.0

 Attachments: HADOOP-10251.patch, HADOOP-10251.patch, 
 HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch


 Following corner scenario happened in one of our cluster.
 1. NN1 was Active and NN2 was Standby
 2. NN2 machine's network was slow 
 3. NN1 got shutdown.
 4. NN2 ZKFC got the notification and trying to check for old active for 
 fencing. (This took little more time, again due to slow network)
 5. In between, NN1 got restarted by our automatic monitoring, and ZKFC made 
 it Active.
 6. Now NN2 ZKFC got Old Active as NN1 and it did graceful fencing of NN1 to 
 STANBY.
 7. Before writing ActiveBreadCrumb to ZK, NN2 ZKFC got session timeout and 
 got shutdown before making NN2 Active.
 *Now cluster having both NameNodes as STANDBY.*
 NN1 ZKFC still thinks that its nameNode is in Active state. 
 NN2 ZKFC waiting for election.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-10251) Both NameNodes could be in STANDBY State if SNN network is unstable

2015-06-11 Thread Vinayakumar B (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14581795#comment-14581795
 ] 

Vinayakumar B commented on HADOOP-10251:


I am not getting whats the problem in your cluster. That too only with haadmin 
failover where as auto failover works fine.

Can you share the autofailover logs for zkfc.

 Both NameNodes could be in STANDBY State if SNN network is unstable
 ---

 Key: HADOOP-10251
 URL: https://issues.apache.org/jira/browse/HADOOP-10251
 Project: Hadoop Common
  Issue Type: Bug
  Components: ha
Affects Versions: 2.2.0
Reporter: Vinayakumar B
Assignee: Vinayakumar B
Priority: Critical
 Fix For: 2.5.0

 Attachments: HADOOP-10251.patch, HADOOP-10251.patch, 
 HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch


 Following corner scenario happened in one of our cluster.
 1. NN1 was Active and NN2 was Standby
 2. NN2 machine's network was slow 
 3. NN1 got shutdown.
 4. NN2 ZKFC got the notification and trying to check for old active for 
 fencing. (This took little more time, again due to slow network)
 5. In between, NN1 got restarted by our automatic monitoring, and ZKFC made 
 it Active.
 6. Now NN2 ZKFC got Old Active as NN1 and it did graceful fencing of NN1 to 
 STANBY.
 7. Before writing ActiveBreadCrumb to ZK, NN2 ZKFC got session timeout and 
 got shutdown before making NN2 Active.
 *Now cluster having both NameNodes as STANDBY.*
 NN1 ZKFC still thinks that its nameNode is in Active state. 
 NN2 ZKFC waiting for election.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-10251) Both NameNodes could be in STANDBY State if SNN network is unstable

2015-06-11 Thread lvchuanwen (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14581767#comment-14581767
 ] 

lvchuanwen commented on HADOOP-10251:
-

yes .hadoop-deamon.sh start namenode nn1 then kill nn2 ,nn1 can  transition to 
active

 Both NameNodes could be in STANDBY State if SNN network is unstable
 ---

 Key: HADOOP-10251
 URL: https://issues.apache.org/jira/browse/HADOOP-10251
 Project: Hadoop Common
  Issue Type: Bug
  Components: ha
Affects Versions: 2.2.0
Reporter: Vinayakumar B
Assignee: Vinayakumar B
Priority: Critical
 Fix For: 2.5.0

 Attachments: HADOOP-10251.patch, HADOOP-10251.patch, 
 HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch


 Following corner scenario happened in one of our cluster.
 1. NN1 was Active and NN2 was Standby
 2. NN2 machine's network was slow 
 3. NN1 got shutdown.
 4. NN2 ZKFC got the notification and trying to check for old active for 
 fencing. (This took little more time, again due to slow network)
 5. In between, NN1 got restarted by our automatic monitoring, and ZKFC made 
 it Active.
 6. Now NN2 ZKFC got Old Active as NN1 and it did graceful fencing of NN1 to 
 STANBY.
 7. Before writing ActiveBreadCrumb to ZK, NN2 ZKFC got session timeout and 
 got shutdown before making NN2 Active.
 *Now cluster having both NameNodes as STANDBY.*
 NN1 ZKFC still thinks that its nameNode is in Active state. 
 NN2 ZKFC waiting for election.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-10251) Both NameNodes could be in STANDBY State if SNN network is unstable

2015-06-11 Thread lvchuanwen (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14581769#comment-14581769
 ] 

lvchuanwen commented on HADOOP-10251:
-

yes .hadoop-deamon.sh start namenode nn1 then kill nn2 ,nn1 can  transition to 
active

 Both NameNodes could be in STANDBY State if SNN network is unstable
 ---

 Key: HADOOP-10251
 URL: https://issues.apache.org/jira/browse/HADOOP-10251
 Project: Hadoop Common
  Issue Type: Bug
  Components: ha
Affects Versions: 2.2.0
Reporter: Vinayakumar B
Assignee: Vinayakumar B
Priority: Critical
 Fix For: 2.5.0

 Attachments: HADOOP-10251.patch, HADOOP-10251.patch, 
 HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch


 Following corner scenario happened in one of our cluster.
 1. NN1 was Active and NN2 was Standby
 2. NN2 machine's network was slow 
 3. NN1 got shutdown.
 4. NN2 ZKFC got the notification and trying to check for old active for 
 fencing. (This took little more time, again due to slow network)
 5. In between, NN1 got restarted by our automatic monitoring, and ZKFC made 
 it Active.
 6. Now NN2 ZKFC got Old Active as NN1 and it did graceful fencing of NN1 to 
 STANBY.
 7. Before writing ActiveBreadCrumb to ZK, NN2 ZKFC got session timeout and 
 got shutdown before making NN2 Active.
 *Now cluster having both NameNodes as STANDBY.*
 NN1 ZKFC still thinks that its nameNode is in Active state. 
 NN2 ZKFC waiting for election.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-10251) Both NameNodes could be in STANDBY State if SNN network is unstable

2015-06-11 Thread lvchuanwen (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14581770#comment-14581770
 ] 

lvchuanwen commented on HADOOP-10251:
-

yes .hadoop-deamon.sh start namenode nn1 then kill nn2 ,nn1 can  transition to 
active

 Both NameNodes could be in STANDBY State if SNN network is unstable
 ---

 Key: HADOOP-10251
 URL: https://issues.apache.org/jira/browse/HADOOP-10251
 Project: Hadoop Common
  Issue Type: Bug
  Components: ha
Affects Versions: 2.2.0
Reporter: Vinayakumar B
Assignee: Vinayakumar B
Priority: Critical
 Fix For: 2.5.0

 Attachments: HADOOP-10251.patch, HADOOP-10251.patch, 
 HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch


 Following corner scenario happened in one of our cluster.
 1. NN1 was Active and NN2 was Standby
 2. NN2 machine's network was slow 
 3. NN1 got shutdown.
 4. NN2 ZKFC got the notification and trying to check for old active for 
 fencing. (This took little more time, again due to slow network)
 5. In between, NN1 got restarted by our automatic monitoring, and ZKFC made 
 it Active.
 6. Now NN2 ZKFC got Old Active as NN1 and it did graceful fencing of NN1 to 
 STANBY.
 7. Before writing ActiveBreadCrumb to ZK, NN2 ZKFC got session timeout and 
 got shutdown before making NN2 Active.
 *Now cluster having both NameNodes as STANDBY.*
 NN1 ZKFC still thinks that its nameNode is in Active state. 
 NN2 ZKFC waiting for election.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-10251) Both NameNodes could be in STANDBY State if SNN network is unstable

2015-06-11 Thread lvchuanwen (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14581778#comment-14581778
 ] 

lvchuanwen commented on HADOOP-10251:
-

yes .hadoop-deamon.sh start namenode nn1 then kill nn2 ,nn1 can  transition to 
active

 Both NameNodes could be in STANDBY State if SNN network is unstable
 ---

 Key: HADOOP-10251
 URL: https://issues.apache.org/jira/browse/HADOOP-10251
 Project: Hadoop Common
  Issue Type: Bug
  Components: ha
Affects Versions: 2.2.0
Reporter: Vinayakumar B
Assignee: Vinayakumar B
Priority: Critical
 Fix For: 2.5.0

 Attachments: HADOOP-10251.patch, HADOOP-10251.patch, 
 HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch


 Following corner scenario happened in one of our cluster.
 1. NN1 was Active and NN2 was Standby
 2. NN2 machine's network was slow 
 3. NN1 got shutdown.
 4. NN2 ZKFC got the notification and trying to check for old active for 
 fencing. (This took little more time, again due to slow network)
 5. In between, NN1 got restarted by our automatic monitoring, and ZKFC made 
 it Active.
 6. Now NN2 ZKFC got Old Active as NN1 and it did graceful fencing of NN1 to 
 STANBY.
 7. Before writing ActiveBreadCrumb to ZK, NN2 ZKFC got session timeout and 
 got shutdown before making NN2 Active.
 *Now cluster having both NameNodes as STANDBY.*
 NN1 ZKFC still thinks that its nameNode is in Active state. 
 NN2 ZKFC waiting for election.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-10251) Both NameNodes could be in STANDBY State if SNN network is unstable

2015-06-11 Thread lvchuanwen (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14581774#comment-14581774
 ] 

lvchuanwen commented on HADOOP-10251:
-

yes .hadoop-deamon.sh start namenode nn1 then kill nn2 ,nn1 can  transition to 
active

 Both NameNodes could be in STANDBY State if SNN network is unstable
 ---

 Key: HADOOP-10251
 URL: https://issues.apache.org/jira/browse/HADOOP-10251
 Project: Hadoop Common
  Issue Type: Bug
  Components: ha
Affects Versions: 2.2.0
Reporter: Vinayakumar B
Assignee: Vinayakumar B
Priority: Critical
 Fix For: 2.5.0

 Attachments: HADOOP-10251.patch, HADOOP-10251.patch, 
 HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch


 Following corner scenario happened in one of our cluster.
 1. NN1 was Active and NN2 was Standby
 2. NN2 machine's network was slow 
 3. NN1 got shutdown.
 4. NN2 ZKFC got the notification and trying to check for old active for 
 fencing. (This took little more time, again due to slow network)
 5. In between, NN1 got restarted by our automatic monitoring, and ZKFC made 
 it Active.
 6. Now NN2 ZKFC got Old Active as NN1 and it did graceful fencing of NN1 to 
 STANBY.
 7. Before writing ActiveBreadCrumb to ZK, NN2 ZKFC got session timeout and 
 got shutdown before making NN2 Active.
 *Now cluster having both NameNodes as STANDBY.*
 NN1 ZKFC still thinks that its nameNode is in Active state. 
 NN2 ZKFC waiting for election.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-10251) Both NameNodes could be in STANDBY State if SNN network is unstable

2015-06-11 Thread lvchuanwen (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14581771#comment-14581771
 ] 

lvchuanwen commented on HADOOP-10251:
-

yes .hadoop-deamon.sh start namenode nn1 then kill nn2 ,nn1 can  transition to 
active

 Both NameNodes could be in STANDBY State if SNN network is unstable
 ---

 Key: HADOOP-10251
 URL: https://issues.apache.org/jira/browse/HADOOP-10251
 Project: Hadoop Common
  Issue Type: Bug
  Components: ha
Affects Versions: 2.2.0
Reporter: Vinayakumar B
Assignee: Vinayakumar B
Priority: Critical
 Fix For: 2.5.0

 Attachments: HADOOP-10251.patch, HADOOP-10251.patch, 
 HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch


 Following corner scenario happened in one of our cluster.
 1. NN1 was Active and NN2 was Standby
 2. NN2 machine's network was slow 
 3. NN1 got shutdown.
 4. NN2 ZKFC got the notification and trying to check for old active for 
 fencing. (This took little more time, again due to slow network)
 5. In between, NN1 got restarted by our automatic monitoring, and ZKFC made 
 it Active.
 6. Now NN2 ZKFC got Old Active as NN1 and it did graceful fencing of NN1 to 
 STANBY.
 7. Before writing ActiveBreadCrumb to ZK, NN2 ZKFC got session timeout and 
 got shutdown before making NN2 Active.
 *Now cluster having both NameNodes as STANDBY.*
 NN1 ZKFC still thinks that its nameNode is in Active state. 
 NN2 ZKFC waiting for election.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-10251) Both NameNodes could be in STANDBY State if SNN network is unstable

2015-06-11 Thread lvchuanwen (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14581776#comment-14581776
 ] 

lvchuanwen commented on HADOOP-10251:
-

yes .hadoop-deamon.sh start namenode nn1 then kill nn2 ,nn1 can  transition to 
active

 Both NameNodes could be in STANDBY State if SNN network is unstable
 ---

 Key: HADOOP-10251
 URL: https://issues.apache.org/jira/browse/HADOOP-10251
 Project: Hadoop Common
  Issue Type: Bug
  Components: ha
Affects Versions: 2.2.0
Reporter: Vinayakumar B
Assignee: Vinayakumar B
Priority: Critical
 Fix For: 2.5.0

 Attachments: HADOOP-10251.patch, HADOOP-10251.patch, 
 HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch


 Following corner scenario happened in one of our cluster.
 1. NN1 was Active and NN2 was Standby
 2. NN2 machine's network was slow 
 3. NN1 got shutdown.
 4. NN2 ZKFC got the notification and trying to check for old active for 
 fencing. (This took little more time, again due to slow network)
 5. In between, NN1 got restarted by our automatic monitoring, and ZKFC made 
 it Active.
 6. Now NN2 ZKFC got Old Active as NN1 and it did graceful fencing of NN1 to 
 STANBY.
 7. Before writing ActiveBreadCrumb to ZK, NN2 ZKFC got session timeout and 
 got shutdown before making NN2 Active.
 *Now cluster having both NameNodes as STANDBY.*
 NN1 ZKFC still thinks that its nameNode is in Active state. 
 NN2 ZKFC waiting for election.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-10251) Both NameNodes could be in STANDBY State if SNN network is unstable

2015-06-11 Thread Vinayakumar B (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14582914#comment-14582914
 ] 

Vinayakumar B commented on HADOOP-10251:


bq. 2015-06-12 02:26:02,608 ERROR org.apache.hadoop.ha.ZKFailoverController: 
Local service NameNode at zdh196/10.43.156.196:9000 has changed the 
serviceState to active. Expected was standby. Quitting election marking fencing 
necessary.
bq. 2015-06-12 02:27:56,878 ERROR org.apache.hadoop.ha.ZKFailoverController: 
Local service NameNode at zdh195/10.43.156.195:9000 has changed the 
serviceState to active. Expected was standby. Quitting election marking fencing 
necessary.

The use case mentioned is normal auto failover. Not the manual failover using 
haadmin commands.
And I am seeing both NN1 and NN2 are not staying in Active mode if the 
transition happens from standby-active. This is strange.

Can you check this ?

1. Stop both ZKFCs. and NNs.
2. Start only one ZKFC and NN. It should successfully convert to active and 
check whether it is staying in ACTIVE for long time.
3. Attach the logs for ZKFC after restart, (logs from the restarted point, all 
lines)
4. Also attach the ZKFailoverController.java source code you are using.

 Both NameNodes could be in STANDBY State if SNN network is unstable
 ---

 Key: HADOOP-10251
 URL: https://issues.apache.org/jira/browse/HADOOP-10251
 Project: Hadoop Common
  Issue Type: Bug
  Components: ha
Affects Versions: 2.2.0
Reporter: Vinayakumar B
Assignee: Vinayakumar B
Priority: Critical
 Fix For: 2.5.0

 Attachments: HADOOP-10251.patch, HADOOP-10251.patch, 
 HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch


 Following corner scenario happened in one of our cluster.
 1. NN1 was Active and NN2 was Standby
 2. NN2 machine's network was slow 
 3. NN1 got shutdown.
 4. NN2 ZKFC got the notification and trying to check for old active for 
 fencing. (This took little more time, again due to slow network)
 5. In between, NN1 got restarted by our automatic monitoring, and ZKFC made 
 it Active.
 6. Now NN2 ZKFC got Old Active as NN1 and it did graceful fencing of NN1 to 
 STANBY.
 7. Before writing ActiveBreadCrumb to ZK, NN2 ZKFC got session timeout and 
 got shutdown before making NN2 Active.
 *Now cluster having both NameNodes as STANDBY.*
 NN1 ZKFC still thinks that its nameNode is in Active state. 
 NN2 ZKFC waiting for election.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-10251) Both NameNodes could be in STANDBY State if SNN network is unstable

2015-06-11 Thread Vinayakumar B (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14581590#comment-14581590
 ] 

Vinayakumar B commented on HADOOP-10251:


Which version of Hadoop You are using?
Because I can see below logs (excluded DEBUG),
{noformat}2015-06-10 02:57:56,073 INFO 
org.apache.hadoop.ha.ZKFailoverController: Successfully transitioned NameNode 
at zdh195/10.43.156.195:9000 to active state
2015-06-10 02:57:56,092 INFO org.apache.hadoop.ha.ZKFailoverController: 
Successfully became active. Successfully transitioned NameNode at 
zdh195/10.43.156.195:9000 to active state
2015-06-10 02:57:57,082 ERROR org.apache.hadoop.ha.ZKFailoverController: Local 
service NameNode at zdh195/10.43.156.195:9000 has changed the serviceState to 
active. Expected was standby. Quitting election marking fencing 
necessary.{noformat}

Immediately after {{becomeActive()}}, ERROR log is showing state expected is 
{{standby}}. {{serviceState}} is changed to {{active}} in {{becomeActive()}} 
immediately after above log.

IMO, this is possible only if {{volatile}} is not present while declaring 
{{serviceState}}
{code}private volatile HAServiceState serviceState = 
HAServiceState.INITIALIZING;{code}

do you have this  in your code?


 Both NameNodes could be in STANDBY State if SNN network is unstable
 ---

 Key: HADOOP-10251
 URL: https://issues.apache.org/jira/browse/HADOOP-10251
 Project: Hadoop Common
  Issue Type: Bug
  Components: ha
Affects Versions: 2.2.0
Reporter: Vinayakumar B
Assignee: Vinayakumar B
Priority: Critical
 Fix For: 2.5.0

 Attachments: HADOOP-10251.patch, HADOOP-10251.patch, 
 HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch


 Following corner scenario happened in one of our cluster.
 1. NN1 was Active and NN2 was Standby
 2. NN2 machine's network was slow 
 3. NN1 got shutdown.
 4. NN2 ZKFC got the notification and trying to check for old active for 
 fencing. (This took little more time, again due to slow network)
 5. In between, NN1 got restarted by our automatic monitoring, and ZKFC made 
 it Active.
 6. Now NN2 ZKFC got Old Active as NN1 and it did graceful fencing of NN1 to 
 STANBY.
 7. Before writing ActiveBreadCrumb to ZK, NN2 ZKFC got session timeout and 
 got shutdown before making NN2 Active.
 *Now cluster having both NameNodes as STANDBY.*
 NN1 ZKFC still thinks that its nameNode is in Active state. 
 NN2 ZKFC waiting for election.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-10251) Both NameNodes could be in STANDBY State if SNN network is unstable

2015-06-11 Thread lvchuanwen (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14581560#comment-14581560
 ] 

lvchuanwen commented on HADOOP-10251:
-

You can try the command hdfs haadmin -failover nn1 nn2 , and then see if the 
active node nn1 is normal.
nn1 will always change state .active - standby - active - standby ...
sorry for my poor english ,hope you can understand.thanks.

 Both NameNodes could be in STANDBY State if SNN network is unstable
 ---

 Key: HADOOP-10251
 URL: https://issues.apache.org/jira/browse/HADOOP-10251
 Project: Hadoop Common
  Issue Type: Bug
  Components: ha
Affects Versions: 2.2.0
Reporter: Vinayakumar B
Assignee: Vinayakumar B
Priority: Critical
 Fix For: 2.5.0

 Attachments: HADOOP-10251.patch, HADOOP-10251.patch, 
 HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch


 Following corner scenario happened in one of our cluster.
 1. NN1 was Active and NN2 was Standby
 2. NN2 machine's network was slow 
 3. NN1 got shutdown.
 4. NN2 ZKFC got the notification and trying to check for old active for 
 fencing. (This took little more time, again due to slow network)
 5. In between, NN1 got restarted by our automatic monitoring, and ZKFC made 
 it Active.
 6. Now NN2 ZKFC got Old Active as NN1 and it did graceful fencing of NN1 to 
 STANBY.
 7. Before writing ActiveBreadCrumb to ZK, NN2 ZKFC got session timeout and 
 got shutdown before making NN2 Active.
 *Now cluster having both NameNodes as STANDBY.*
 NN1 ZKFC still thinks that its nameNode is in Active state. 
 NN2 ZKFC waiting for election.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-10251) Both NameNodes could be in STANDBY State if SNN network is unstable

2015-06-11 Thread lvchuanwen (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14582800#comment-14582800
 ] 

lvchuanwen commented on HADOOP-10251:
-

Use Cases:
1.NN1 was Active and NN2 was Standby ,kill NN1 . NN2 transition to active.
2.hadoop-daemon.sh start namenode NN2. NOW.NN1 was Standby and NN2 was Active .
3.kill NN2 ,NN1 transition to active.
Attaching hdfs-nn1-zkfc-host195.log file and hdfs-nn2-zkfc-host196.log file


 Both NameNodes could be in STANDBY State if SNN network is unstable
 ---

 Key: HADOOP-10251
 URL: https://issues.apache.org/jira/browse/HADOOP-10251
 Project: Hadoop Common
  Issue Type: Bug
  Components: ha
Affects Versions: 2.2.0
Reporter: Vinayakumar B
Assignee: Vinayakumar B
Priority: Critical
 Fix For: 2.5.0

 Attachments: HADOOP-10251.patch, HADOOP-10251.patch, 
 HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch


 Following corner scenario happened in one of our cluster.
 1. NN1 was Active and NN2 was Standby
 2. NN2 machine's network was slow 
 3. NN1 got shutdown.
 4. NN2 ZKFC got the notification and trying to check for old active for 
 fencing. (This took little more time, again due to slow network)
 5. In between, NN1 got restarted by our automatic monitoring, and ZKFC made 
 it Active.
 6. Now NN2 ZKFC got Old Active as NN1 and it did graceful fencing of NN1 to 
 STANBY.
 7. Before writing ActiveBreadCrumb to ZK, NN2 ZKFC got session timeout and 
 got shutdown before making NN2 Active.
 *Now cluster having both NameNodes as STANDBY.*
 NN1 ZKFC still thinks that its nameNode is in Active state. 
 NN2 ZKFC waiting for election.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-10251) Both NameNodes could be in STANDBY State if SNN network is unstable

2015-06-11 Thread Xinglong.Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14582751#comment-14582751
 ] 

Xinglong.Li commented on HADOOP-10251:
--

顶!

 Both NameNodes could be in STANDBY State if SNN network is unstable
 ---

 Key: HADOOP-10251
 URL: https://issues.apache.org/jira/browse/HADOOP-10251
 Project: Hadoop Common
  Issue Type: Bug
  Components: ha
Affects Versions: 2.2.0
Reporter: Vinayakumar B
Assignee: Vinayakumar B
Priority: Critical
 Fix For: 2.5.0

 Attachments: HADOOP-10251.patch, HADOOP-10251.patch, 
 HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch


 Following corner scenario happened in one of our cluster.
 1. NN1 was Active and NN2 was Standby
 2. NN2 machine's network was slow 
 3. NN1 got shutdown.
 4. NN2 ZKFC got the notification and trying to check for old active for 
 fencing. (This took little more time, again due to slow network)
 5. In between, NN1 got restarted by our automatic monitoring, and ZKFC made 
 it Active.
 6. Now NN2 ZKFC got Old Active as NN1 and it did graceful fencing of NN1 to 
 STANBY.
 7. Before writing ActiveBreadCrumb to ZK, NN2 ZKFC got session timeout and 
 got shutdown before making NN2 Active.
 *Now cluster having both NameNodes as STANDBY.*
 NN1 ZKFC still thinks that its nameNode is in Active state. 
 NN2 ZKFC waiting for election.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-10251) Both NameNodes could be in STANDBY State if SNN network is unstable

2015-06-11 Thread lvchuanwen (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14582802#comment-14582802
 ] 

lvchuanwen commented on HADOOP-10251:
-

hdfs-nn1-zkfc-host195.log

2015-06-12 02:25:38,799 WARN org.apache.hadoop.ha.HealthMonitor: 
Transport-level exception trying to monitor health of NameNode at 
zdh195/10.43.156.195:9000: Failed on local exception: java.io.EOFException; 
Host Details : local host is: zdh195/10.43.156.195; destination host is: 
zdh195:9000; 
2015-06-12 02:25:38,799 INFO org.apache.hadoop.ha.HealthMonitor: Entering state 
SERVICE_NOT_RESPONDING
2015-06-12 02:25:38,800 INFO org.apache.hadoop.ha.ZKFailoverController: Local 
service NameNode at zdh195/10.43.156.195:9000 entered state: 
SERVICE_NOT_RESPONDING
2015-06-12 02:25:38,800 INFO org.apache.hadoop.ha.ZKFailoverController: 
Quitting master election for NameNode at zdh195/10.43.156.195:9000 and marking 
that fencing is necessary
2015-06-12 02:25:38,800 INFO org.apache.hadoop.ha.ActiveStandbyElector: 
Yielding from election
2015-06-12 02:25:38,803 INFO org.apache.zookeeper.ZooKeeper: Session: 
0x24d91acb5cb1c33 closed
2015-06-12 02:25:38,803 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
Ignoring stale result from old client with sessionId 0x24d91acb5cb1c33
2015-06-12 02:25:38,803 INFO org.apache.zookeeper.ClientCnxn: EventThread shut 
down
2015-06-12 02:25:40,805 INFO org.apache.hadoop.ipc.Client: Retrying connect to 
server: zdh195/10.43.156.195:9000. Already tried 0 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000 MILLISECONDS)
2015-06-12 02:25:40,807 WARN org.apache.hadoop.ha.HealthMonitor: 
Transport-level exception trying to monitor health of NameNode at 
zdh195/10.43.156.195:9000: Call From zdh195/10.43.156.195 to zdh195:9000 failed 
on connection exception: java.net.ConnectException: 拒绝连接; For more details see: 
 http://wiki.apache.org/hadoop/ConnectionRefused
...
...
...
2015-06-12 02:27:22,976 WARN org.apache.hadoop.ha.HealthMonitor: 
Transport-level exception trying to monitor health of NameNode at 
zdh195/10.43.156.195:9000: Call From zdh195/10.43.156.195 to zdh195:9000 failed 
on connection exception: java.net.ConnectException: 拒绝连接; For more details see: 
 http://wiki.apache.org/hadoop/ConnectionRefused
2015-06-12 02:27:24,978 INFO org.apache.hadoop.ipc.Client: Retrying connect to 
server: zdh195/10.43.156.195:9000. Already tried 0 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000 MILLISECONDS)
2015-06-12 02:27:24,979 WARN org.apache.hadoop.ha.HealthMonitor: 
Transport-level exception trying to monitor health of NameNode at 
zdh195/10.43.156.195:9000: Call From zdh195/10.43.156.195 to zdh195:9000 failed 
on connection exception: java.net.ConnectException: 拒绝连接; For more details see: 
 http://wiki.apache.org/hadoop/ConnectionRefused
2015-06-12 02:27:26,981 INFO org.apache.hadoop.ipc.Client: Retrying connect to 
server: zdh195/10.43.156.195:9000. Already tried 0 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000 MILLISECONDS)
2015-06-12 02:27:27,810 INFO org.apache.hadoop.ha.HealthMonitor: Entering state 
SERVICE_HEALTHY
2015-06-12 02:27:27,810 INFO org.apache.hadoop.ha.ZKFailoverController: Local 
service NameNode at zdh195/10.43.156.195:9000 entered state: SERVICE_HEALTHY
2015-06-12 02:27:27,811 INFO org.apache.zookeeper.ZooKeeper: Initiating client 
connection, connectString=zdh196:2181,zdh195:2181,zdh197:2181 
sessionTimeout=1 
watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@651a6959
2015-06-12 02:27:27,812 INFO org.apache.zookeeper.ClientCnxn: Opening socket 
connection to server zdh195/10.43.156.195:2181. Will not attempt to 
authenticate using SASL (unknown error)
2015-06-12 02:27:27,812 INFO org.apache.zookeeper.ClientCnxn: Socket connection 
established to zdh195/10.43.156.195:2181, initiating session
2015-06-12 02:27:27,814 INFO org.apache.zookeeper.ClientCnxn: Session 
establishment complete on server zdh195/10.43.156.195:2181, sessionid = 
0x24d91acb5cb1c51, negotiated timeout = 1
2015-06-12 02:27:27,815 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session 
connected.
2015-06-12 02:27:27,816 INFO org.apache.hadoop.ha.ZKFailoverController: ZK 
Election indicated that NameNode at zdh195/10.43.156.195:9000 should become 
standby
2015-06-12 02:27:27,824 INFO org.apache.hadoop.ha.ZKFailoverController: 
Successfully transitioned NameNode at zdh195/10.43.156.195:9000 to standby state
2015-06-12 02:27:48,662 INFO org.apache.hadoop.ha.ActiveStandbyElector: 
Checking for any old active which needs to be fenced...
2015-06-12 02:27:48,663 INFO org.apache.hadoop.ha.ActiveStandbyElector: Old 
node exists: 0a0a636c757374657231393512036e6e321a067a646831393620a84628d33e
2015-06-12 02:27:48,665 INFO 

[jira] [Commented] (HADOOP-10251) Both NameNodes could be in STANDBY State if SNN network is unstable

2015-06-11 Thread lvchuanwen (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14582806#comment-14582806
 ] 

lvchuanwen commented on HADOOP-10251:
-

##hdfs-nn2-zkfc-host196.log 
file###

2015-06-12 02:25:54,146 INFO org.apache.hadoop.ha.ActiveStandbyElector: 
Checking for any old active which needs to be fenced...
2015-06-12 02:25:54,147 INFO org.apache.hadoop.ha.ActiveStandbyElector: Old 
node exists: 0a0a636c757374657231393512036e6e311a067a646831393520a84628d33e
2015-06-12 02:25:54,149 INFO org.apache.hadoop.ha.ZKFailoverController: Should 
fence: NameNode at zdh195/10.43.156.195:9000
2015-06-12 02:25:55,151 INFO org.apache.hadoop.ipc.Client: Retrying connect to 
server: zdh195/10.43.156.195:9000. Already tried 0 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000 MILLISECONDS)
2015-06-12 02:25:55,152 WARN org.apache.hadoop.ha.FailoverController: Unable to 
gracefully make NameNode at zdh195/10.43.156.195:9000 standby (unable to 
connect)
java.net.ConnectException: Call From zdh196/10.43.156.196 to zdh195:9000 failed 
on connection exception: java.net.ConnectException: 拒绝连接; For more details see: 
 http://wiki.apache.org/hadoop/ConnectionRefused
at sun.reflect.GeneratedConstructorAccessor24.newInstance(Unknown 
Source)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730)
at org.apache.hadoop.ipc.Client.call(Client.java:1415)
at org.apache.hadoop.ipc.Client.call(Client.java:1364)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
at com.sun.proxy.$Proxy12.transitionToStandby(Unknown Source)
at 
org.apache.hadoop.ha.protocolPB.HAServiceProtocolClientSideTranslatorPB.transitionToStandby(HAServiceProtocolClientSideTranslatorPB.java:112)
at 
org.apache.hadoop.ha.FailoverController.tryGracefulFence(FailoverController.java:172)
at 
org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:516)
at 
org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:507)
at 
org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:61)
at 
org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:894)
at 
org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:901)
at 
org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:800)
at 
org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415)
at 
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:605)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:499)
Caused by: java.net.ConnectException: 拒绝连接
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
at 
org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493)
at 
org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:606)
at 
org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:700)
at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:367)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1463)
at org.apache.hadoop.ipc.Client.call(Client.java:1382)
... 14 more
2015-06-12 02:25:55,153 INFO org.apache.hadoop.ha.NodeFencer: == Beginning 
Service Fencing Process... ==
2015-06-12 02:25:55,154 INFO org.apache.hadoop.ha.NodeFencer: Trying method 
1/2: org.apache.hadoop.ha.SshFenceByTcpPort(null)
2015-06-12 02:25:55,157 INFO org.apache.hadoop.ha.SshFenceByTcpPort: Connecting 
to zdh195...
2015-06-12 02:25:55,157 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: 
Connecting to zdh195 port 22
2015-06-12 02:25:55,159 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: 
Connection established
2015-06-12 02:25:55,188 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: 
Remote version string: SSH-2.0-OpenSSH_5.3
2015-06-12 02:25:55,188 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Local 
version string: SSH-2.0-JSCH-0.1.42
2015-06-12 02:25:55,189 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: 
CheckCiphers: 

[jira] [Commented] (HADOOP-10251) Both NameNodes could be in STANDBY State if SNN network is unstable

2015-06-11 Thread lvchuanwen (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14581665#comment-14581665
 ] 

lvchuanwen commented on HADOOP-10251:
-

version 2.5.0. 
volatile is present 

 Both NameNodes could be in STANDBY State if SNN network is unstable
 ---

 Key: HADOOP-10251
 URL: https://issues.apache.org/jira/browse/HADOOP-10251
 Project: Hadoop Common
  Issue Type: Bug
  Components: ha
Affects Versions: 2.2.0
Reporter: Vinayakumar B
Assignee: Vinayakumar B
Priority: Critical
 Fix For: 2.5.0

 Attachments: HADOOP-10251.patch, HADOOP-10251.patch, 
 HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch


 Following corner scenario happened in one of our cluster.
 1. NN1 was Active and NN2 was Standby
 2. NN2 machine's network was slow 
 3. NN1 got shutdown.
 4. NN2 ZKFC got the notification and trying to check for old active for 
 fencing. (This took little more time, again due to slow network)
 5. In between, NN1 got restarted by our automatic monitoring, and ZKFC made 
 it Active.
 6. Now NN2 ZKFC got Old Active as NN1 and it did graceful fencing of NN1 to 
 STANBY.
 7. Before writing ActiveBreadCrumb to ZK, NN2 ZKFC got session timeout and 
 got shutdown before making NN2 Active.
 *Now cluster having both NameNodes as STANDBY.*
 NN1 ZKFC still thinks that its nameNode is in Active state. 
 NN2 ZKFC waiting for election.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-10251) Both NameNodes could be in STANDBY State if SNN network is unstable

2015-06-09 Thread Vinayakumar B (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14578728#comment-14578728
 ] 

Vinayakumar B commented on HADOOP-10251:


bq. After removing the code // healthMonitor.addServiceStateCallback(new 
ServiceStateCallBacks()); failover command recovery normal
i did not understand the problem, Can you elaborate more please?

 Both NameNodes could be in STANDBY State if SNN network is unstable
 ---

 Key: HADOOP-10251
 URL: https://issues.apache.org/jira/browse/HADOOP-10251
 Project: Hadoop Common
  Issue Type: Bug
  Components: ha
Affects Versions: 2.2.0
Reporter: Vinayakumar B
Assignee: Vinayakumar B
Priority: Critical
 Fix For: 2.5.0

 Attachments: HADOOP-10251.patch, HADOOP-10251.patch, 
 HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch


 Following corner scenario happened in one of our cluster.
 1. NN1 was Active and NN2 was Standby
 2. NN2 machine's network was slow 
 3. NN1 got shutdown.
 4. NN2 ZKFC got the notification and trying to check for old active for 
 fencing. (This took little more time, again due to slow network)
 5. In between, NN1 got restarted by our automatic monitoring, and ZKFC made 
 it Active.
 6. Now NN2 ZKFC got Old Active as NN1 and it did graceful fencing of NN1 to 
 STANBY.
 7. Before writing ActiveBreadCrumb to ZK, NN2 ZKFC got session timeout and 
 got shutdown before making NN2 Active.
 *Now cluster having both NameNodes as STANDBY.*
 NN1 ZKFC still thinks that its nameNode is in Active state. 
 NN2 ZKFC waiting for election.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-10251) Both NameNodes could be in STANDBY State if SNN network is unstable

2015-06-09 Thread lvchuanwen (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14579884#comment-14579884
 ] 

lvchuanwen commented on HADOOP-10251:
-

Attaching some logs:

2015-06-10 02:57:53,727 DEBUG org.apache.hadoop.ipc.Server: Successfully 
authorized userInfo {
  effectiveUser: hdfs
}
protocol: org.apache.hadoop.ha.ZKFCProtocol

2015-06-10 02:57:53,728 DEBUG org.apache.hadoop.ipc.Server:  got #0
2015-06-10 02:57:53,728 DEBUG org.apache.hadoop.ipc.Server: IPC Server handler 
2 on 8019: org.apache.hadoop.ha.ZKFCProtocol.gracefulFailover from 
10.43.156.196:49132 Call#0 Retry#0 for RpcKind RPC_PROTOCOL_BUFFER
2015-06-10 02:57:53,730 DEBUG org.apache.hadoop.security.UserGroupInformation: 
PrivilegedAction as:hdfs (auth:SIMPLE) 
from:org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
2015-06-10 02:57:53,755 DEBUG org.apache.hadoop.security.Groups: Returning 
fetched groups for 'hdfs'
2015-06-10 02:57:53,755 INFO 
org.apache.hadoop.hdfs.tools.DFSZKFailoverController: Allowed RPC access from 
hdfs (auth:SIMPLE) at 10.43.156.196
2015-06-10 02:57:53,756 DEBUG org.apache.hadoop.security.UserGroupInformation: 
PrivilegedAction as:hdfs (auth:SIMPLE) 
from:org.apache.hadoop.ha.ZKFailoverController.gracefulFailoverToYou(ZKFailoverController.java:603)
2015-06-10 02:57:53,760 DEBUG org.apache.zookeeper.ClientCnxn: Reading reply 
sessionid:0x34dcf74b50a05d7, packet:: clientPath:null serverPath:null 
finished:false header:: 4,4  replyHeader:: 4,38654797504,0  request:: 
'/hadoop-ha/cluster195/ActiveStandbyElectorLock,F  response:: 
#aa636c75737465723139351236e6e321a67a646831393620ffa84628ffd33e,s{38654797500,38654797500,1433876199859,1433876199859,0,0,0,93891338261633392,31,0,38654797500}
 
2015-06-10 02:57:53,768 DEBUG org.apache.hadoop.hdfs.server.namenode.NameNode: 
Setting fs.defaultFS to hdfs://zdh196:9000
2015-06-10 02:57:53,770 INFO org.apache.hadoop.ha.ZKFailoverController: Asking 
NameNode at zdh196/10.43.156.196:9000 to cede its active state for 1ms
2015-06-10 02:57:53,772 DEBUG org.apache.hadoop.ipc.Client: getting client out 
of cache: org.apache.hadoop.ipc.Client@7007cf85
2015-06-10 02:57:53,779 DEBUG org.apache.hadoop.ipc.Client: The ping interval 
is 6 ms.
2015-06-10 02:57:53,779 DEBUG org.apache.hadoop.ipc.Client: Connecting to 
zdh196/10.43.156.196:8019
2015-06-10 02:57:53,781 DEBUG org.apache.hadoop.ipc.Client: IPC Client 
(256152889) connection to zdh196/10.43.156.196:8019 from hdfs: starting, having 
connections 2
2015-06-10 02:57:53,781 DEBUG org.apache.hadoop.ipc.Client: IPC Client 
(256152889) connection to zdh196/10.43.156.196:8019 from hdfs sending #147
2015-06-10 02:57:53,969 DEBUG org.apache.zookeeper.ClientCnxn: Got notification 
sessionid:0x34dcf74b50a05d7
2015-06-10 02:57:53,970 DEBUG org.apache.zookeeper.ClientCnxn: Got WatchedEvent 
state:SyncConnected type:NodeDeleted 
path:/hadoop-ha/cluster195/ActiveStandbyElectorLock for sessionid 
0x34dcf74b50a05d7
2015-06-10 02:57:53,974 DEBUG org.apache.hadoop.ha.ActiveStandbyElector: 
Watcher event type: NodeDeleted with state:SyncConnected for 
path:/hadoop-ha/cluster195/ActiveStandbyElectorLock connectionState: CONNECTED 
for elector id=1893177739 
appData=0a0a636c757374657231393512036e6e311a067a646831393520a84628d33e 
cb=Elector callbacks for NameNode at zdh195/10.43.156.195:9000
2015-06-10 02:57:53,979 DEBUG org.apache.hadoop.ipc.Client: IPC Client 
(256152889) connection to zdh196/10.43.156.196:8019 from hdfs got value #147
2015-06-10 02:57:53,979 DEBUG org.apache.zookeeper.ClientCnxn: Reading reply 
sessionid:0x34dcf74b50a05d7, packet:: 
clientPath:/hadoop-ha/cluster195/ActiveStandbyElectorLock 
serverPath:/hadoop-ha/cluster195/ActiveStandbyElectorLock finished:false 
header:: 5,1  replyHeader:: 5,38654797507,0  request:: 
'/hadoop-ha/cluster195/ActiveStandbyElectorLock,#aa636c75737465723139351236e6e311a67a646831393520ffa84628ffd33e,v{s{31,s{'world,'anyone}}},1
  response:: '/hadoop-ha/cluster195/ActiveStandbyElectorLock 
2015-06-10 02:57:53,979 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine: Call: 
cedeActive took 200ms
2015-06-10 02:57:53,982 DEBUG org.apache.hadoop.ha.ActiveStandbyElector: 
CreateNode result: 0 for path: /hadoop-ha/cluster195/ActiveStandbyElectorLock 
connectionState: CONNECTED  for elector id=1893177739 
appData=0a0a636c757374657231393512036e6e311a067a646831393520a84628d33e 
cb=Elector callbacks for NameNode at zdh195/10.43.156.195:9000
2015-06-10 02:57:53,982 INFO org.apache.hadoop.ha.ActiveStandbyElector: 
Checking for any old active which needs to be fenced...
2015-06-10 02:57:53,985 DEBUG org.apache.zookeeper.ClientCnxn: Reading reply 
sessionid:0x34dcf74b50a05d7, packet:: clientPath:null serverPath:null 
finished:false header:: 6,4  replyHeader:: 6,38654797507,-101  request:: 
'/hadoop-ha/cluster195/ActiveBreadCrumb,F  response::  
2015-06-10 02:57:53,996 INFO 

[jira] [Commented] (HADOOP-10251) Both NameNodes could be in STANDBY State if SNN network is unstable

2015-06-08 Thread lvchuanwen (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14578184#comment-14578184
 ] 

lvchuanwen commented on HADOOP-10251:
-

hi,nn1 and nn2  are alternately transformed into active state, as long as 
running hdfs haadmin -failover nn1 nn2.
After removing the code // healthMonitor.addServiceStateCallback(new 
ServiceStateCallBacks());   failover command  recovery normal


thank you .

 Both NameNodes could be in STANDBY State if SNN network is unstable
 ---

 Key: HADOOP-10251
 URL: https://issues.apache.org/jira/browse/HADOOP-10251
 Project: Hadoop Common
  Issue Type: Bug
  Components: ha
Affects Versions: 2.2.0
Reporter: Vinayakumar B
Assignee: Vinayakumar B
Priority: Critical
 Fix For: 2.5.0

 Attachments: HADOOP-10251.patch, HADOOP-10251.patch, 
 HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch


 Following corner scenario happened in one of our cluster.
 1. NN1 was Active and NN2 was Standby
 2. NN2 machine's network was slow 
 3. NN1 got shutdown.
 4. NN2 ZKFC got the notification and trying to check for old active for 
 fencing. (This took little more time, again due to slow network)
 5. In between, NN1 got restarted by our automatic monitoring, and ZKFC made 
 it Active.
 6. Now NN2 ZKFC got Old Active as NN1 and it did graceful fencing of NN1 to 
 STANBY.
 7. Before writing ActiveBreadCrumb to ZK, NN2 ZKFC got session timeout and 
 got shutdown before making NN2 Active.
 *Now cluster having both NameNodes as STANDBY.*
 NN1 ZKFC still thinks that its nameNode is in Active state. 
 NN2 ZKFC waiting for election.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-10251) Both NameNodes could be in STANDBY State if SNN network is unstable

2014-04-25 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13981004#comment-13981004
 ] 

Hudson commented on HADOOP-10251:
-

FAILURE: Integrated in Hadoop-Hdfs-trunk #1742 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1742/])
HADOOP-10251. Both NameNodes could be in STANDBY State if SNN network is 
unstable. Contributed by Vinayakumar B. (umamahesh: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1589494)
* /hadoop/common/trunk/hadoop-common-project/hadoop-common/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ha/HealthMonitor.java
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ha/ZKFailoverController.java
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/ha/TestZKFailoverController.java


 Both NameNodes could be in STANDBY State if SNN network is unstable
 ---

 Key: HADOOP-10251
 URL: https://issues.apache.org/jira/browse/HADOOP-10251
 Project: Hadoop Common
  Issue Type: Bug
  Components: ha
Affects Versions: 2.2.0
Reporter: Vinayakumar B
Assignee: Vinayakumar B
Priority: Critical
 Fix For: 3.0.0, 2.5.0

 Attachments: HADOOP-10251.patch, HADOOP-10251.patch, 
 HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch


 Following corner scenario happened in one of our cluster.
 1. NN1 was Active and NN2 was Standby
 2. NN2 machine's network was slow 
 3. NN1 got shutdown.
 4. NN2 ZKFC got the notification and trying to check for old active for 
 fencing. (This took little more time, again due to slow network)
 5. In between, NN1 got restarted by our automatic monitoring, and ZKFC made 
 it Active.
 6. Now NN2 ZKFC got Old Active as NN1 and it did graceful fencing of NN1 to 
 STANBY.
 7. Before writing ActiveBreadCrumb to ZK, NN2 ZKFC got session timeout and 
 got shutdown before making NN2 Active.
 *Now cluster having both NameNodes as STANDBY.*
 NN1 ZKFC still thinks that its nameNode is in Active state. 
 NN2 ZKFC waiting for election.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HADOOP-10251) Both NameNodes could be in STANDBY State if SNN network is unstable

2014-04-24 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13979592#comment-13979592
 ] 

Hudson commented on HADOOP-10251:
-

ABORTED: Integrated in Hadoop-Yarn-trunk #550 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/550/])
HADOOP-10251. Both NameNodes could be in STANDBY State if SNN network is 
unstable. Contributed by Vinayakumar B. (umamahesh: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1589494)
* /hadoop/common/trunk/hadoop-common-project/hadoop-common/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ha/HealthMonitor.java
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ha/ZKFailoverController.java
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/ha/TestZKFailoverController.java


 Both NameNodes could be in STANDBY State if SNN network is unstable
 ---

 Key: HADOOP-10251
 URL: https://issues.apache.org/jira/browse/HADOOP-10251
 Project: Hadoop Common
  Issue Type: Bug
  Components: ha
Affects Versions: 2.2.0
Reporter: Vinayakumar B
Assignee: Vinayakumar B
Priority: Critical
 Fix For: 3.0.0, 2.5.0

 Attachments: HADOOP-10251.patch, HADOOP-10251.patch, 
 HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch


 Following corner scenario happened in one of our cluster.
 1. NN1 was Active and NN2 was Standby
 2. NN2 machine's network was slow 
 3. NN1 got shutdown.
 4. NN2 ZKFC got the notification and trying to check for old active for 
 fencing. (This took little more time, again due to slow network)
 5. In between, NN1 got restarted by our automatic monitoring, and ZKFC made 
 it Active.
 6. Now NN2 ZKFC got Old Active as NN1 and it did graceful fencing of NN1 to 
 STANBY.
 7. Before writing ActiveBreadCrumb to ZK, NN2 ZKFC got session timeout and 
 got shutdown before making NN2 Active.
 *Now cluster having both NameNodes as STANDBY.*
 NN1 ZKFC still thinks that its nameNode is in Active state. 
 NN2 ZKFC waiting for election.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HADOOP-10251) Both NameNodes could be in STANDBY State if SNN network is unstable

2014-04-24 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13979710#comment-13979710
 ] 

Hudson commented on HADOOP-10251:
-

FAILURE: Integrated in Hadoop-Mapreduce-trunk #1767 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1767/])
HADOOP-10251. Both NameNodes could be in STANDBY State if SNN network is 
unstable. Contributed by Vinayakumar B. (umamahesh: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1589494)
* /hadoop/common/trunk/hadoop-common-project/hadoop-common/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ha/HealthMonitor.java
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ha/ZKFailoverController.java
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/ha/TestZKFailoverController.java


 Both NameNodes could be in STANDBY State if SNN network is unstable
 ---

 Key: HADOOP-10251
 URL: https://issues.apache.org/jira/browse/HADOOP-10251
 Project: Hadoop Common
  Issue Type: Bug
  Components: ha
Affects Versions: 2.2.0
Reporter: Vinayakumar B
Assignee: Vinayakumar B
Priority: Critical
 Fix For: 3.0.0, 2.5.0

 Attachments: HADOOP-10251.patch, HADOOP-10251.patch, 
 HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch


 Following corner scenario happened in one of our cluster.
 1. NN1 was Active and NN2 was Standby
 2. NN2 machine's network was slow 
 3. NN1 got shutdown.
 4. NN2 ZKFC got the notification and trying to check for old active for 
 fencing. (This took little more time, again due to slow network)
 5. In between, NN1 got restarted by our automatic monitoring, and ZKFC made 
 it Active.
 6. Now NN2 ZKFC got Old Active as NN1 and it did graceful fencing of NN1 to 
 STANBY.
 7. Before writing ActiveBreadCrumb to ZK, NN2 ZKFC got session timeout and 
 got shutdown before making NN2 Active.
 *Now cluster having both NameNodes as STANDBY.*
 NN1 ZKFC still thinks that its nameNode is in Active state. 
 NN2 ZKFC waiting for election.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HADOOP-10251) Both NameNodes could be in STANDBY State if SNN network is unstable

2014-04-23 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13978796#comment-13978796
 ] 

Hudson commented on HADOOP-10251:
-

SUCCESS: Integrated in Hadoop-trunk-Commit #5554 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/5554/])
HADOOP-10251. Both NameNodes could be in STANDBY State if SNN network is 
unstable. Contributed by Vinayakumar B. (umamahesh: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1589494)
* /hadoop/common/trunk/hadoop-common-project/hadoop-common/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ha/HealthMonitor.java
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ha/ZKFailoverController.java
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/ha/TestZKFailoverController.java


 Both NameNodes could be in STANDBY State if SNN network is unstable
 ---

 Key: HADOOP-10251
 URL: https://issues.apache.org/jira/browse/HADOOP-10251
 Project: Hadoop Common
  Issue Type: Bug
  Components: ha
Affects Versions: 2.2.0
Reporter: Vinayakumar B
Assignee: Vinayakumar B
Priority: Critical
 Fix For: 3.0.0, 2.5.0

 Attachments: HADOOP-10251.patch, HADOOP-10251.patch, 
 HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch


 Following corner scenario happened in one of our cluster.
 1. NN1 was Active and NN2 was Standby
 2. NN2 machine's network was slow 
 3. NN1 got shutdown.
 4. NN2 ZKFC got the notification and trying to check for old active for 
 fencing. (This took little more time, again due to slow network)
 5. In between, NN1 got restarted by our automatic monitoring, and ZKFC made 
 it Active.
 6. Now NN2 ZKFC got Old Active as NN1 and it did graceful fencing of NN1 to 
 STANBY.
 7. Before writing ActiveBreadCrumb to ZK, NN2 ZKFC got session timeout and 
 got shutdown before making NN2 Active.
 *Now cluster having both NameNodes as STANDBY.*
 NN1 ZKFC still thinks that its nameNode is in Active state. 
 NN2 ZKFC waiting for election.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HADOOP-10251) Both NameNodes could be in STANDBY State if SNN network is unstable

2014-04-22 Thread Uma Maheswara Rao G (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13977851#comment-13977851
 ] 

Uma Maheswara Rao G commented on HADOOP-10251:
--

+1 on the latest patch. I will commit it shortly.

 Both NameNodes could be in STANDBY State if SNN network is unstable
 ---

 Key: HADOOP-10251
 URL: https://issues.apache.org/jira/browse/HADOOP-10251
 Project: Hadoop Common
  Issue Type: Bug
  Components: ha
Affects Versions: 2.2.0
Reporter: Vinayakumar B
Assignee: Vinayakumar B
Priority: Critical
 Attachments: HADOOP-10251.patch, HADOOP-10251.patch, 
 HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch


 Following corner scenario happened in one of our cluster.
 1. NN1 was Active and NN2 was Standby
 2. NN2 machine's network was slow 
 3. NN1 got shutdown.
 4. NN2 ZKFC got the notification and trying to check for old active for 
 fencing. (This took little more time, again due to slow network)
 5. In between, NN1 got restarted by our automatic monitoring, and ZKFC made 
 it Active.
 6. Now NN2 ZKFC got Old Active as NN1 and it did graceful fencing of NN1 to 
 STANBY.
 7. Before writing ActiveBreadCrumb to ZK, NN2 ZKFC got session timeout and 
 got shutdown before making NN2 Active.
 *Now cluster having both NameNodes as STANDBY.*
 NN1 ZKFC still thinks that its nameNode is in Active state. 
 NN2 ZKFC waiting for election.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HADOOP-10251) Both NameNodes could be in STANDBY State if SNN network is unstable

2014-04-18 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13973893#comment-13973893
 ] 

Hadoop QA commented on HADOOP-10251:


{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12640785/HADOOP-10251.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-common-project/hadoop-common.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HADOOP-Build/3809//testReport/
Console output: 
https://builds.apache.org/job/PreCommit-HADOOP-Build/3809//console

This message is automatically generated.

 Both NameNodes could be in STANDBY State if SNN network is unstable
 ---

 Key: HADOOP-10251
 URL: https://issues.apache.org/jira/browse/HADOOP-10251
 Project: Hadoop Common
  Issue Type: Bug
  Components: ha
Affects Versions: 2.2.0
Reporter: Vinayakumar B
Assignee: Vinayakumar B
Priority: Critical
 Attachments: HADOOP-10251.patch, HADOOP-10251.patch, 
 HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch


 Following corner scenario happened in one of our cluster.
 1. NN1 was Active and NN2 was Standby
 2. NN2 machine's network was slow 
 3. NN1 got shutdown.
 4. NN2 ZKFC got the notification and trying to check for old active for 
 fencing. (This took little more time, again due to slow network)
 5. In between, NN1 got restarted by our automatic monitoring, and ZKFC made 
 it Active.
 6. Now NN2 ZKFC got Old Active as NN1 and it did graceful fencing of NN1 to 
 STANBY.
 7. Before writing ActiveBreadCrumb to ZK, NN2 ZKFC got session timeout and 
 got shutdown before making NN2 Active.
 *Now cluster having both NameNodes as STANDBY.*
 NN1 ZKFC still thinks that its nameNode is in Active state. 
 NN2 ZKFC waiting for election.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HADOOP-10251) Both NameNodes could be in STANDBY State if SNN network is unstable

2014-04-17 Thread Uma Maheswara Rao G (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13973090#comment-13973090
 ] 

Uma Maheswara Rao G commented on HADOOP-10251:
--

I think your idea will work and patch almost looks good to me.

One question:
{code}
 /**
+   * Callback interface for service state change events.
+   * 
+   * This interface is called whenever there is a change in the service state.
+   */
{code}
Seems like this interface will be called on every health monitor status check. 
But doc says its a service state changed event. It is exposing impl details as 
you do that state comparisions in implementation and do necessary actions on 
state change. So, at this interface level, we are not sure whether service 
state changed from last state or not right. Can you update this something like 
Service state notification?

 Both NameNodes could be in STANDBY State if SNN network is unstable
 ---

 Key: HADOOP-10251
 URL: https://issues.apache.org/jira/browse/HADOOP-10251
 Project: Hadoop Common
  Issue Type: Bug
  Components: ha
Affects Versions: 2.2.0
Reporter: Vinayakumar B
Assignee: Vinayakumar B
Priority: Critical
 Attachments: HADOOP-10251.patch, HADOOP-10251.patch, 
 HADOOP-10251.patch, HADOOP-10251.patch


 Following corner scenario happened in one of our cluster.
 1. NN1 was Active and NN2 was Standby
 2. NN2 machine's network was slow 
 3. NN1 got shutdown.
 4. NN2 ZKFC got the notification and trying to check for old active for 
 fencing. (This took little more time, again due to slow network)
 5. In between, NN1 got restarted by our automatic monitoring, and ZKFC made 
 it Active.
 6. Now NN2 ZKFC got Old Active as NN1 and it did graceful fencing of NN1 to 
 STANBY.
 7. Before writing ActiveBreadCrumb to ZK, NN2 ZKFC got session timeout and 
 got shutdown before making NN2 Active.
 *Now cluster having both NameNodes as STANDBY.*
 NN1 ZKFC still thinks that its nameNode is in Active state. 
 NN2 ZKFC waiting for election.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HADOOP-10251) Both NameNodes could be in STANDBY State if SNN network is unstable

2014-04-15 Thread Vinayakumar B (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13969340#comment-13969340
 ] 

Vinayakumar B commented on HADOOP-10251:


Hi.. 
Can please someone review this patch? 
Thanks in advance

 Both NameNodes could be in STANDBY State if SNN network is unstable
 ---

 Key: HADOOP-10251
 URL: https://issues.apache.org/jira/browse/HADOOP-10251
 Project: Hadoop Common
  Issue Type: Bug
  Components: ha
Affects Versions: 2.2.0
Reporter: Vinayakumar B
Assignee: Vinayakumar B
Priority: Critical
 Attachments: HADOOP-10251.patch, HADOOP-10251.patch, 
 HADOOP-10251.patch, HADOOP-10251.patch


 Following corner scenario happened in one of our cluster.
 1. NN1 was Active and NN2 was Standby
 2. NN2 machine's network was slow 
 3. NN1 got shutdown.
 4. NN2 ZKFC got the notification and trying to check for old active for 
 fencing. (This took little more time, again due to slow network)
 5. In between, NN1 got restarted by our automatic monitoring, and ZKFC made 
 it Active.
 6. Now NN2 ZKFC got Old Active as NN2 and it did graceful fencing of NN1 to 
 STANBY.
 7. Before writing ActiveBreadCrumb to ZK, NN2 ZKFC got session timeout and 
 got shutdown before making NN2 Active.
 *Now cluster having both NameNodes as STANDBY.*
 NN1 ZKFC still thinks that its nameNode is in Active state. 
 NN2 ZKFC waiting for election.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HADOOP-10251) Both NameNodes could be in STANDBY State if SNN network is unstable

2014-02-26 Thread Vinayakumar B (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13913016#comment-13913016
 ] 

Vinayakumar B commented on HADOOP-10251:


Hi.. 
Can please someone review this patch? 
Thanks

 Both NameNodes could be in STANDBY State if SNN network is unstable
 ---

 Key: HADOOP-10251
 URL: https://issues.apache.org/jira/browse/HADOOP-10251
 Project: Hadoop Common
  Issue Type: Bug
  Components: ha
Affects Versions: 2.2.0
Reporter: Vinayakumar B
Assignee: Vinayakumar B
Priority: Critical
 Attachments: HADOOP-10251.patch, HADOOP-10251.patch, 
 HADOOP-10251.patch, HADOOP-10251.patch


 Following corner scenario happened in one of our cluster.
 1. NN1 was Active and NN2 was Standby
 2. NN2 machine's network was slow 
 3. NN1 got shutdown.
 4. NN2 ZKFC got the notification and trying to check for old active for 
 fencing. (This took little more time, again due to slow network)
 5. In between, NN1 got restarted by our automatic monitoring, and ZKFC made 
 it Active.
 6. Now NN2 ZKFC got Old Active as NN2 and it did graceful fencing of NN1 to 
 STANBY.
 7. Before writing ActiveBreadCrumb to ZK, NN2 ZKFC got session timeout and 
 got shutdown before making NN2 Active.
 *Now cluster having both NameNodes as STANDBY.*
 NN1 ZKFC still thinks that its nameNode is in Active state. 
 NN2 ZKFC waiting for election.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HADOOP-10251) Both NameNodes could be in STANDBY State if SNN network is unstable

2014-02-04 Thread Vinay (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13890860#comment-13890860
 ] 

Vinay commented on HADOOP-10251:


Hi, Can someone please review the patch.. Thanks

 Both NameNodes could be in STANDBY State if SNN network is unstable
 ---

 Key: HADOOP-10251
 URL: https://issues.apache.org/jira/browse/HADOOP-10251
 Project: Hadoop Common
  Issue Type: Bug
  Components: ha
Affects Versions: 2.2.0
Reporter: Vinay
Assignee: Vinay
Priority: Critical
 Attachments: HADOOP-10251.patch, HADOOP-10251.patch, 
 HADOOP-10251.patch, HADOOP-10251.patch


 Following corner scenario happened in one of our cluster.
 1. NN1 was Active and NN2 was Standby
 2. NN2 machine's network was slow 
 3. NN1 got shutdown.
 4. NN2 ZKFC got the notification and trying to check for old active for 
 fencing. (This took little more time, again due to slow network)
 5. In between, NN1 got restarted by our automatic monitoring, and ZKFC made 
 it Active.
 6. Now NN2 ZKFC got Old Active as NN2 and it did graceful fencing of NN1 to 
 STANBY.
 7. Before writing ActiveBreadCrumb to ZK, NN2 ZKFC got session timeout and 
 got shutdown before making NN2 Active.
 *Now cluster having both NameNodes as STANDBY.*
 NN1 ZKFC still thinks that its nameNode is in Active state. 
 NN2 ZKFC waiting for election.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HADOOP-10251) Both NameNodes could be in STANDBY State if SNN network is unstable

2014-02-04 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13890924#comment-13890924
 ] 

Hadoop QA commented on HADOOP-10251:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12624583/HADOOP-10251.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:red}-1 javadoc{color}.  The javadoc tool appears to have generated 2 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-common-project/hadoop-common.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HADOOP-Build/3528//testReport/
Console output: 
https://builds.apache.org/job/PreCommit-HADOOP-Build/3528//console

This message is automatically generated.

 Both NameNodes could be in STANDBY State if SNN network is unstable
 ---

 Key: HADOOP-10251
 URL: https://issues.apache.org/jira/browse/HADOOP-10251
 Project: Hadoop Common
  Issue Type: Bug
  Components: ha
Affects Versions: 2.2.0
Reporter: Vinay
Assignee: Vinay
Priority: Critical
 Attachments: HADOOP-10251.patch, HADOOP-10251.patch, 
 HADOOP-10251.patch, HADOOP-10251.patch


 Following corner scenario happened in one of our cluster.
 1. NN1 was Active and NN2 was Standby
 2. NN2 machine's network was slow 
 3. NN1 got shutdown.
 4. NN2 ZKFC got the notification and trying to check for old active for 
 fencing. (This took little more time, again due to slow network)
 5. In between, NN1 got restarted by our automatic monitoring, and ZKFC made 
 it Active.
 6. Now NN2 ZKFC got Old Active as NN2 and it did graceful fencing of NN1 to 
 STANBY.
 7. Before writing ActiveBreadCrumb to ZK, NN2 ZKFC got session timeout and 
 got shutdown before making NN2 Active.
 *Now cluster having both NameNodes as STANDBY.*
 NN1 ZKFC still thinks that its nameNode is in Active state. 
 NN2 ZKFC waiting for election.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HADOOP-10251) Both NameNodes could be in STANDBY State if SNN network is unstable

2014-01-22 Thread Vinay (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13878576#comment-13878576
 ] 

Vinay commented on HADOOP-10251:


ZKFC health check, checks the state of the NameNode, but it doesnot validate it 
with expected state.


 Both NameNodes could be in STANDBY State if SNN network is unstable
 ---

 Key: HADOOP-10251
 URL: https://issues.apache.org/jira/browse/HADOOP-10251
 Project: Hadoop Common
  Issue Type: Bug
  Components: ha
Affects Versions: 2.2.0
Reporter: Vinay
Assignee: Vinay
Priority: Critical

 Following corner scenario happened in one of our cluster.
 1. NN1 was Active and NN2 was Standby
 2. NN2 machine's network was slow 
 3. NN1 got shutdown.
 4. NN2 ZKFC got the notification and trying to check for old active for 
 fencing. (This took little more time, again due to slow network)
 5. In between, NN1 got restarted by our automatic monitoring, and ZKFC made 
 it Active.
 6. Now NN2 ZKFC got Old Active as NN2 and it did graceful fencing of NN1 to 
 STANBY.
 7. Before writing ActiveBreadCrumb to ZK, NN2 ZKFC got session timeout and 
 got shutdown before making NN2 Active.
 *Now cluster having both NameNodes as STANDBY.*
 NN1 ZKFC still thinks that its nameNode is in Active state. 
 NN2 ZKFC waiting for election.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HADOOP-10251) Both NameNodes could be in STANDBY State if SNN network is unstable

2014-01-22 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13878739#comment-13878739
 ] 

Hadoop QA commented on HADOOP-10251:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12624335/HADOOP-10251.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-common-project/hadoop-common:

  org.apache.hadoop.ha.TestZKFailoverController

  The following test timeouts occurred in 
hadoop-common-project/hadoop-common:

org.apache.hadoop.ha.TestZKFailoverControllerStress

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HADOOP-Build/3457//testReport/
Console output: 
https://builds.apache.org/job/PreCommit-HADOOP-Build/3457//console

This message is automatically generated.

 Both NameNodes could be in STANDBY State if SNN network is unstable
 ---

 Key: HADOOP-10251
 URL: https://issues.apache.org/jira/browse/HADOOP-10251
 Project: Hadoop Common
  Issue Type: Bug
  Components: ha
Affects Versions: 2.2.0
Reporter: Vinay
Assignee: Vinay
Priority: Critical
 Attachments: HADOOP-10251.patch


 Following corner scenario happened in one of our cluster.
 1. NN1 was Active and NN2 was Standby
 2. NN2 machine's network was slow 
 3. NN1 got shutdown.
 4. NN2 ZKFC got the notification and trying to check for old active for 
 fencing. (This took little more time, again due to slow network)
 5. In between, NN1 got restarted by our automatic monitoring, and ZKFC made 
 it Active.
 6. Now NN2 ZKFC got Old Active as NN2 and it did graceful fencing of NN1 to 
 STANBY.
 7. Before writing ActiveBreadCrumb to ZK, NN2 ZKFC got session timeout and 
 got shutdown before making NN2 Active.
 *Now cluster having both NameNodes as STANDBY.*
 NN1 ZKFC still thinks that its nameNode is in Active state. 
 NN2 ZKFC waiting for election.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)