[ 
https://issues.apache.org/jira/browse/MESOS-3532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14933347#comment-14933347
 ] 

Edward Donahue III commented on MESOS-3532:
-------------------------------------------


Here are the log entries when the master restarts:

mesos-master[23265]: F0928 14:12:54.010592 23268 master.cpp:1253] Recovery 
failed: Failed to recover registrar: Failed to perform fetch within 1mins
mesos-master[23265]: *** Check failure stack trace: ***
mesos-master[23265]: @     0x7f9e75b38bbd  google::LogMessage::Fail()
mesos-master[23265]: @     0x7f9e75b3a8fc  google::LogMessage::SendToLog()
mesos-master[23265]: @     0x7f9e75b387ac  google::LogMessage::Flush()
mesos-master[23265]: @     0x7f9e75b3b1f9  
google::LogMessageFatal::~LogMessageFatal()
mesos-master[23265]: @     0x7f9e75566b3c  mesos::internal::master::fail()
mesos-master[23265]: @     0x7f9e75598b20  
_ZNSt17_Function_handlerIFvRKSsEZNK7process6FutureI7NothingE8onFailedISt5_BindIFPFvS1_S1_EPKc...ny_dataS1_
mesos-master[23265]: @           0x426bfe  process::Future<>::fail()
mesos-master[23265]: @     0x7f9e755c5bc5  process::internal::thenf<>()
mesos-master[23265]: @     0x7f9e7560dad6  
_ZN7process8internal3runISt8functionIFvRKNS_6FutureIN5mesos8internal8RegistryEEEEEJRS7_EEEvRK...E_EEDpOT0_
mesos-master[23265]: @     0x7f9e7561e0f2  process::Future<>::fail()
mesos-master[23265]: @     0x7f9e754c6436  process::internal::run<>()
mesos-master[23265]: @     0x7f9e7561e0df  process::Future<>::fail()
mesos-master[23265]: @     0x7f9e756028bc  
mesos::internal::master::RegistrarProcess::_recover()
mesos-master[23265]: @     0x7f9e75aeafe9  process::ProcessManager::resume()
mesos-master[23265]: @     0x7f9e75aeb2df  process::schedule()
mesos-master[23265]: @     0x7f9e74b24df5  start_thread
mesos-master[23265]: @     0x7f9e740331ad  __clone

> 3 Master HA setup restarts every 3 minutes
> ------------------------------------------
>
>                 Key: MESOS-3532
>                 URL: https://issues.apache.org/jira/browse/MESOS-3532
>             Project: Mesos
>          Issue Type: Bug
>    Affects Versions: 0.23.0
>            Reporter: Edward Donahue III
>
> CentOS 7.1, 3 Node cluster, each host has mesos master/slave and zookeeper 
> setup.
> After I pushed out a bad zoo.cfg  (added 2 extra zookeeper hosts that didn't 
> exist) about every three minutes the elected master restarts and this keeps 
> happening, when I have just one of the three masters running, it restarts 
> every 3 minutes.  
> I fixed the configs, deleted all the files under 
> (/var/log/zookeeper/version-2/, /var/lib/zookeeper/version-2/).  Is there 
> another step I need to take, I feel like zookeeper is the issue (also where I 
> lack knowledge), this cluster was stable for months until I push out the bad 
> zoo.cfg.
> The master logs have this output every second:
> I0928 13:56:05.281518 28448 replica.cpp:641] Replica in EMPTY status received 
> a broadcasted recover request
> I0928 13:56:05.351608 28450 replica.cpp:641] Replica in EMPTY status received 
> a broadcasted recover request
> I0928 13:56:05.351794 28448 recover.cpp:195] Received a recover response from 
> a replica in EMPTY status
> I0928 13:56:05.352700 28452 recover.cpp:195] Received a recover response from 
> a replica in EMPTY status
> I0928 13:56:05.352963 28447 recover.cpp:195] Received a recover response from 
> a replica in VOTING status
> The mesos-slaves don't even register in time:
> I0928 13:55:40.041491 28418 slave.cpp:3087] [email protected]:5050 exited
> W0928 13:55:40.041574 28418 slave.cpp:3090] Master disconnected! Waiting for 
> a new master to be elected
> E0928 13:55:40.250059 28420 socket.hpp:107] Shutdown failed on fd=9: 
> Transport endpoint is not connected [107]
> I0928 13:55:48.005607 28418 detector.cpp:138] Detected a new leader: (id='14')
> I0928 13:55:48.005836 28417 group.cpp:656] Trying to get 
> '/mesos/info_0000000014' in ZooKeeper
> W0928 13:55:48.006597 28417 detector.cpp:444] Leading master 
> [email protected]:5050 is using a Protobuf binary f...ESOS-2340)
> I0928 13:55:48.006652 28417 detector.cpp:481] A new leading master 
> ([email protected]:5050) is detected
> I0928 13:55:48.006731 28417 slave.cpp:684] New master detected at 
> [email protected]:5050
> I0928 13:55:48.006891 28417 slave.cpp:709] No credentials provided. 
> Attempting to register without authentication
> I0928 13:55:48.006911 28417 slave.cpp:720] Detecting new master
> I0928 13:55:48.006940 28417 status_update_manager.cpp:176] Pausing sending 
> status updates



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to