Edward Donahue III created MESOS-3532:
-----------------------------------------
Summary: 3 Master HA setup restarts every 3 minutes
Key: MESOS-3532
URL: https://issues.apache.org/jira/browse/MESOS-3532
Project: Mesos
Issue Type: Bug
Affects Versions: 0.23.0
Reporter: Edward Donahue III
CentOS 7.1, 3 Node cluster, each host has mesos master/slave and zookeeper
setup.
After I pushed out a bad zoo.cfg (added 2 extra zookeeper hosts that didn't
exist) about every three minutes the elected master restarts and this keeps
happening, when I have just one of the three masters running, it restarts every
3 minutes.
I fixed the configs, deleted all the files under
(/var/log/zookeeper/version-2/, /var/lib/zookeeper/version-2/). Is there
another step I need to take, I feel like zookeeper is the issue (also where I
lack knowledge), this cluster was stable for months until I push out the bad
zoo.cfg.
The master logs have this output every second:
I0928 13:56:05.281518 28448 replica.cpp:641] Replica in EMPTY status received a
broadcasted recover request
I0928 13:56:05.351608 28450 replica.cpp:641] Replica in EMPTY status received a
broadcasted recover request
I0928 13:56:05.351794 28448 recover.cpp:195] Received a recover response from a
replica in EMPTY status
I0928 13:56:05.352700 28452 recover.cpp:195] Received a recover response from a
replica in EMPTY status
I0928 13:56:05.352963 28447 recover.cpp:195] Received a recover response from a
replica in VOTING status
The mesos-slaves don't even register in time:
I0928 13:55:40.041491 28418 slave.cpp:3087] [email protected]:5050 exited
W0928 13:55:40.041574 28418 slave.cpp:3090] Master disconnected! Waiting for a
new master to be elected
E0928 13:55:40.250059 28420 socket.hpp:107] Shutdown failed on fd=9: Transport
endpoint is not connected [107]
I0928 13:55:48.005607 28418 detector.cpp:138] Detected a new leader: (id='14')
I0928 13:55:48.005836 28417 group.cpp:656] Trying to get
'/mesos/info_0000000014' in ZooKeeper
W0928 13:55:48.006597 28417 detector.cpp:444] Leading master
[email protected]:5050 is using a Protobuf binary f...ESOS-2340)
I0928 13:55:48.006652 28417 detector.cpp:481] A new leading master
([email protected]:5050) is detected
I0928 13:55:48.006731 28417 slave.cpp:684] New master detected at
[email protected]:5050
I0928 13:55:48.006891 28417 slave.cpp:709] No credentials provided. Attempting
to register without authentication
I0928 13:55:48.006911 28417 slave.cpp:720] Detecting new master
I0928 13:55:48.006940 28417 status_update_manager.cpp:176] Pausing sending
status updates
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)