[ 
https://issues.apache.org/jira/browse/HADOOP-12680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15071414#comment-15071414
 ] 

Lin Yiqun commented on HADOOP-12680:
------------------------------------

I show the some of zkfc log in this case:
{code}
2015-12-24 17:33:43,873 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session 
expired. Entering neutral mode and rejoining...
2015-12-24 17:33:43,873 INFO org.apache.hadoop.ha.ActiveStandbyElector: Trying 
to re-establish ZK session
2015-12-24 17:33:43,875 INFO org.apache.zookeeper.ZooKeeper: Initiating client 
connection, 
connectString=10.13.8.24:2181,10.13.8.25:2181,10.13.8.26:2181,10.13.8.27:2181,10.13.7.33:2181
 sessionTimeout=30000 
watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@56d70b02
2015-12-24 17:33:43,884 INFO org.apache.zookeeper.ClientCnxn: Opening socket 
connection to server 10.13.8.25/10.13.8.25:2181. Will not attempt to 
authenticate using SASL (unknown error)
2015-12-24 17:33:43,884 WARN org.apache.zookeeper.ClientCnxn: Session 0x0 for 
server null, unexpected error, closing socket connection and attempting 
reconnect
java.net.ConnectException: Connection refused
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
        at 
org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
2015-12-24 17:33:43,905 INFO org.apache.zookeeper.ClientCnxn: Unable to 
reconnect to ZooKeeper service, session 0x451703dcdf7d107 has expired, closing 
socket connection
2015-12-24 17:33:43,985 INFO org.apache.zookeeper.ClientCnxn: Opening socket 
connection to server 10.13.7.33/10.13.7.33:2181. Will not attempt to 
authenticate using SASL (unknown error)
2015-12-24 17:33:43,985 INFO org.apache.zookeeper.ClientCnxn: Socket connection 
established to 10.13.7.33/10.13.7.33:2181, initiating session
2015-12-24 17:33:43,985 INFO org.apache.zookeeper.ClientCnxn: Unable to read 
additional data from server sessionid 0x0, likely server has closed socket, 
closing socket connection and attempting reconnect
2015-12-24 17:33:44,712 INFO org.apache.zookeeper.ClientCnxn: Opening socket 
connection to server 10.13.8.24/10.13.8.24:2181. Will not attempt to 
authenticate using SASL (unknown error)
2015-12-24 17:33:44,712 WARN org.apache.zookeeper.ClientCnxn: Session 0x0 for 
server null, unexpected error, closing socket connection and attempting 
reconnect
java.net.ConnectException: Connection refused
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
        at 
org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
2015-12-24 17:33:45,806 INFO org.apache.zookeeper.ClientCnxn: Opening socket 
connection to server 10.13.8.26/10.13.8.26:2181. Will not attempt to 
authenticate using SASL (unknown error)
2015-12-24 17:33:45,807 INFO org.apache.zookeeper.ClientCnxn: Socket connection 
established to 10.13.8.26/10.13.8.26:2181, initiating session
2015-12-24 17:33:45,807 INFO org.apache.zookeeper.ClientCnxn: Unable to read 
additional data from server sessionid 0x0, likely server has closed socket, 
closing socket connection and attempting reconnect
2015-12-24 17:33:46,549 INFO org.apache.zookeeper.ClientCnxn: Opening socket 
connection to server 10.13.8.27/10.13.8.27:2181. Will not attempt to 
authenticate using SASL (unknown error)
2015-12-24 17:33:46,550 INFO org.apache.zookeeper.ClientCnxn: Socket connection 
established to 10.13.8.27/10.13.8.27:2181, initiating session
2015-12-24 17:33:46,561 INFO org.apache.zookeeper.ClientCnxn: Session 
establishment complete on server 10.13.8.27/10.13.8.27:2181, sessionid = 
0x451d35639b5002a, negotiated timeout = 30000
2015-12-24 17:33:46,563 INFO org.apache.zookeeper.ClientCnxn: EventThread shut 
down
2015-12-24 17:33:46,564 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session 
connected.
2015-12-24 17:33:46,573 INFO org.apache.hadoop.ha.ZKFailoverController: ZK 
Election indicated that NameNode at qihe2192/10.12.2.192:9000 should become 
standby
2015-12-24 17:33:46,575 INFO org.apache.hadoop.ha.ZKFailoverController: 
Successfully transitioned NameNode at qihe2192/10.12.2.192:9000 to standby state
2015-12-24 17:47:21,517 WARN org.apache.hadoop.ha.HealthMonitor: 
Transport-level exception trying to monitor health of NameNode at 
qihe2192/10.12.2.192:9000: java.io.IOException: Connection reset by peer Failed 
on local exception: java.io.IOException: Connection reset by peer; Host Details 
: local host is: "qihe2192/10.12.2.192"; destination host is: "qihe2192":9000;
{code}
{code}
2015-12-24 17:33:44,860 INFO org.apache.zookeeper.ClientCnxn: Unable to 
reconnect to ZooKeeper service, session 0x551703eef8b00c2 has expired, closing 
socket connection
2015-12-24 17:33:44,861 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session 
expired. Entering neutral mode and rejoining...
2015-12-24 17:33:44,861 INFO org.apache.hadoop.ha.ActiveStandbyElector: Trying 
to re-establish ZK session
2015-12-24 17:33:44,862 INFO org.apache.zookeeper.ZooKeeper: Initiating client 
connection, 
connectString=10.13.8.24:2181,10.13.8.25:2181,10.13.8.26:2181,10.13.8.27:2181,10.13.7.33:2181
 sessionTimeout=30000 
watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@5eefe70b
2015-12-24 17:33:44,863 INFO org.apache.zookeeper.ClientCnxn: Opening socket 
connection to server 10.13.8.27/10.13.8.27:2181. Will not attempt to 
authenticate using SASL (unknown error)
2015-12-24 17:33:44,863 INFO org.apache.zookeeper.ClientCnxn: Socket connection 
established to 10.13.8.27/10.13.8.27:2181, initiating session
2015-12-24 17:33:44,871 INFO org.apache.zookeeper.ClientCnxn: Session 
establishment complete on server 10.13.8.27/10.13.8.27:2181, sessionid = 
0x451d35639b50012, negotiated timeout = 30000
2015-12-24 17:33:44,873 INFO org.apache.zookeeper.ClientCnxn: EventThread shut 
down
2015-12-24 17:33:44,874 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session 
connected.
2015-12-24 17:33:44,892 INFO org.apache.hadoop.ha.ZKFailoverController: ZK 
Election indicated that NameNode at qihe2182/10.12.2.182:9000 should become 
standby
2015-12-24 17:33:44,928 INFO org.apache.hadoop.ha.ZKFailoverController: 
Successfully transitioned NameNode at qihe2182/10.12.2.182:9000 to standby state
2015-12-24 17:47:20,883 WARN org.apache.hadoop.ha.HealthMonitor: 
Transport-level exception trying to monitor health of NameNode at 
qihe2182/10.12.2.182:9000: java.io.IOException: Connection reset by peer Failed 
on local exception: java.io.IOException: Connection reset by peer; Host Details 
: local host is: "qihe2182/10.12.2.182"; destination host is: "qihe2182":9000;
2015-12-24 17:47:20,883 INFO org.apache.hadoop.ha.HealthMonitor: Entering state 
SERVICE_NOT_RESPONDING
{code}
In {{2015-12-24 17:33}}, namenode are all transitioned to standby state.

> Loss of zookeeper quorum lead all the namenode to be standby state
> ------------------------------------------------------------------
>
>                 Key: HADOOP-12680
>                 URL: https://issues.apache.org/jira/browse/HADOOP-12680
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: ha
>    Affects Versions: 2.7.1
>            Reporter: Lin Yiqun
>
> When I am upgrading my zookeeper cluster, and will change the ip address of 
> zk nodes. And I found two namenodes of my hadoop cluster got loss of 
> connection with zk. And when I revocer the zk cluster, the two namenodes are 
> both transitioned to standby state and this makes cluster can't provide 
> service. I found the reason may be is following:
> {code}
> /**
>      * If the elector gets disconnected from Zookeeper and does not know about
>      * the lock state, then it will notify the service via the 
> enterNeutralMode
>      * interface. The service may choose to ignore this or stop doing state
>      * changing operations. Upon reconnection, the elector verifies the leader
>      * status and calls back on the becomeActive and becomeStandby app
>      * interfaces. <br/>
>      * Zookeeper disconnects can happen due to network issues or loss of
>      * Zookeeper quorum. Thus enterNeutralMode can be used to guard against
>      * split-brain issues. In such situations it might be prudent to call
>      * becomeStandby too. However, such state change operations might be
>      * expensive and enterNeutralMode can help guard against doing that for
>      * transient issues.
>      */
>     void enterNeutralMode();
> {code}
> May be we should create a thread to monitor the stat of namenodes and don't 
> let them all to be standby state.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to