Y. SREENIVASULU REDDY created HBASE-14498:
---------------------------------------------
Summary: Master stuck in infinite loop when all Zookeeper servers
are unreachable.
Key: HBASE-14498
URL: https://issues.apache.org/jira/browse/HBASE-14498
Project: HBase
Issue Type: Bug
Components: master
Affects Versions: 1.0.1
Reporter: Y. SREENIVASULU REDDY
Priority: Blocker
We met a weird scenario in our production environment.
In a HA cluster,
> Active Master (HM1) is not able to connect to any Zookeeper server (due to
> N/w breakdown on master machine network with Zookeeper servers).
{code}
2015-09-26 15:24:47,508 INFO
[HM1-Host:16000.activeMasterManager-SendThread(ZK-Host:2181)]
zookeeper.ClientCnxn: Client session timed out, have not heard from server in
33463ms for sessionid 0x104576b8dda0002, closing socket connection and
attempting reconnect
2015-09-26 15:24:47,877 INFO
[HM1-Host:16000.activeMasterManager-SendThread(ZK-Host1:2181)]
client.FourLetterWordMain: connecting to ZK-Host1 2181
2015-09-26 15:24:48,236 INFO [main-SendThread(ZK-Host1:2181)]
client.FourLetterWordMain: connecting to ZK-Host1 2181
2015-09-26 15:24:49,879 WARN
[HM1-Host:16000.activeMasterManager-SendThread(ZK-Host1:2181)]
zookeeper.ClientCnxn: Can not get the principle name from server ZK-Host1
2015-09-26 15:24:49,879 INFO
[HM1-Host:16000.activeMasterManager-SendThread(ZK-Host1:2181)]
zookeeper.ClientCnxn: Opening socket connection to server ZK-Host1/ZK-IP1:2181.
Will not attempt to authenticate using SASL (unknown error)
2015-09-26 15:24:50,238 WARN [main-SendThread(ZK-Host1:2181)]
zookeeper.ClientCnxn: Can not get the principle name from server ZK-Host1
2015-09-26 15:24:50,238 INFO [main-SendThread(ZK-Host1:2181)]
zookeeper.ClientCnxn: Opening socket connection to server
ZK-Host1/ZK-Host1:2181. Will not attempt to authenticate using SASL (unknown
error)
2015-09-26 15:25:17,470 INFO [main-SendThread(ZK-Host1:2181)]
zookeeper.ClientCnxn: Client session timed out, have not heard from server in
30023ms for sessionid 0x2045762cc710006, closing socket connection and
attempting reconnect
2015-09-26 15:25:17,571 WARN [master/HM1-Host/HM1-IP:16000]
zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper,
quorum=ZK-Host:2181,ZK-Host1:2181,ZK-Host2:2181,
exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss for /hbase/master
2015-09-26 15:25:17,872 INFO [main-SendThread(ZK-Host:2181)]
client.FourLetterWordMain: connecting to ZK-Host 2181
2015-09-26 15:25:19,874 WARN [main-SendThread(ZK-Host:2181)]
zookeeper.ClientCnxn: Can not get the principle name from server ZK-Host
2015-09-26 15:25:19,874 INFO [main-SendThread(ZK-Host:2181)]
zookeeper.ClientCnxn: Opening socket connection to server ZK-Host/ZK-IP:2181.
Will not attempt to authenticate using SASL (unknown error)
{code}
> Since HM1 was not able to connect to any ZK, so session timeout didnt happen
> at Zookeeper server side and HM1 didnt abort.
> On Zookeeper session timeout standby master (HM2) registered himself as an
> active master.
> HM2 is keep on waiting for region server to report him as part of active
> master intialization.
{noformat}
2015-09-26 15:24:44,928 | INFO | HM2-Host:21300.activeMasterManager | Waiting
for region servers count to settle; currently checked in 0, slept for 0 ms,
expecting minimum of 1, maximum of 2147483647, timeout of 4500 ms, interval of
1500 ms. |
org.apache.hadoop.hbase.master.ServerManager.waitForRegionServers(ServerManager.java:1011)
---
---
2015-09-26 15:32:50,841 | INFO | HM2-Host:21300.activeMasterManager | Waiting
for region servers count to settle; currently checked in 0, slept for 483913
ms, expecting minimum of 1, maximum of 2147483647, timeout of 4500 ms, interval
of 1500 ms. |
org.apache.hadoop.hbase.master.ServerManager.waitForRegionServers(ServerManager.java:1011)
{noformat}
> At other end, region servers are reporting to HM1 on 3 sec interval. Here
> region server retrieve master location from zookeeper only when they couldn't
> connect to Master (ServiceException).
Region Server will not report HM2 as per current design until unless HM1
abort,so HM2 will exit(InitializationMonitor) and again wait for region servers
in loop.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)