[
https://issues.apache.org/jira/browse/MESOS-2014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14194888#comment-14194888
]
Benjamin Mahler commented on MESOS-2014:
----------------------------------------
Hi [~jesson], you need to keep a quorum of masters online for a master to
successfully recover. Typically this means running the master under something
(like Monit) that ensures that a downed master process will be restarted
promptly, on the order of seconds. Are you doing that?
> error of Recovery failed: Failed to recover registrar: Failed to perform
> fetch within 5mins
> -------------------------------------------------------------------------------------------
>
> Key: MESOS-2014
> URL: https://issues.apache.org/jira/browse/MESOS-2014
> Project: Mesos
> Issue Type: Bug
> Components: master
> Affects Versions: 0.20.1
> Environment: CentOS 6.3
> 3.10.5-12.1.x86_64 #1 SMP Fri Aug 16 01:42:38 UTC 2013 x86_64 x86_64 x86_64
> GNU/Linux
> Reporter: Ji Huang
>
> I set up a mesos master cluster with 3 nodes. at the first, everything goes
> well, but when the leader master had dead, other candidate node can not
> recovery and elect new leader, all of candidate node will dead too.
> I1030 15:01:32.005691 6741 detector.cpp:138] Detected a new leader: (id='16')
> I1030 15:01:32.005692 6737 network.hpp:423] ZooKeeper group memberships
> changed
> I1030 15:01:32.006089 6741 group.cpp:658] Trying to get
> '/mesos/info_0000000016' in ZooKeeper
> I1030 15:01:32.006222 6738 group.cpp:658] Trying to get
> '/mesos/log_replicas/0000000015' in ZooKeeper
> I1030 15:01:32.007230 6738 group.cpp:658] Trying to get
> '/mesos/log_replicas/0000000016' in ZooKeeper
> I1030 15:01:32.007268 6736 detector.cpp:426] A new leading master
> ([email protected]:5050) is detected
> I1030 15:01:32.007546 6742 master.cpp:1196] The newly elected leader is
> [email protected]:5050 with id 20141030-150042-94987018-5050-6735
> I1030 15:01:32.007640 6742 master.cpp:1209] Elected as the leading master!
> I1030 15:01:32.007730 6742 master.cpp:1027] Recovering from registrar
> I1030 15:01:32.007895 6736 registrar.cpp:313] Recovering registrar
> I1030 15:01:32.008388 6742 network.hpp:461] ZooKeeper group PIDs: {
> log-replica(1)@10.99.169.5:5050, log-replica(1)@10.99.169.6:5050 }
> I1030 15:01:32.051316 6742 replica.cpp:638] Replica in EMPTY status received
> a broadcasted recover request
> I1030 15:01:32.889194 6738 replica.cpp:638] Replica in EMPTY status received
> a broadcasted recover request
> I1030 15:01:33.469511 6743 replica.cpp:638] Replica in EMPTY status received
> a broadcasted recover request
> I1030 15:01:34.324684 6740 replica.cpp:638] Replica in EMPTY status received
> a broadcasted recover request
> I1030 15:01:35.263629 6736 replica.cpp:638] Replica in EMPTY status received
> a broadcasted recover request
> I1030 15:01:36.212492 6739 replica.cpp:638] Replica in EMPTY status received
> a broadcasted recover request
> I1030 15:01:37.015682 6742 replica.cpp:638] Replica in EMPTY status received
> a broadcasted recover request
> I1030 15:01:37.781746 6743 replica.cpp:638] Replica in EMPTY status received
> a broadcasted recover request
> I1030 15:01:38.494547 6737 replica.cpp:638] Replica in EMPTY status received
> a broadcasted recover request
> I1030 15:01:39.186830 6740 replica.cpp:638] Replica in EMPTY status received
> a broadcasted recover request
> I1030 15:01:40.072258 6736 replica.cpp:638] Replica in EMPTY status received
> a broadcasted recover request
> I1030 15:01:40.855337 6743 replica.cpp:638] Replica in EMPTY status received
> a broadcasted recover request
> I1030 15:01:41.516916 6739 replica.cpp:638] Replica in EMPTY status received
> a broadcasted recover request
> I1030 15:01:41.556437 6744 recover.cpp:111] Unable to finish the recover
> protocol in 10secs, retrying
> I1030 15:01:41.557253 6741 replica.cpp:638] Replica in EMPTY status received
> a broadcasted recover request
> I1030 15:01:41.557502 6739 recover.cpp:188] Received a recover response from
> a replica in EMPTY status
> I1030 15:01:41.558156 6741 recover.cpp:188] Received a recover response from
> a replica in EMPTY status
> I1030 15:01:42.153370 6737 replica.cpp:638] Replica in EMPTY status received
> a broadcasted recover request
> I1030 15:01:42.505698 6742 replica.cpp:638] Replica in EMPTY status received
> a broadcasted recover request
> I1030 15:01:42.506060 6738 recover.cpp:188] Received a recover response from
> a replica in EMPTY status
> I1030 15:01:42.507046 6742 recover.cpp:188] Received a recover response from
> a replica in EMPTY status
> ......
> F1030 15:06:32.009464 6741 master.cpp:1016] Recovery failed: Failed to
> recover registrar: Failed to perform fetch within 5mins
> Core dump info:
> #0 0x0000003d636328a5 in raise () from /lib64/libc.so.6
> #1 0x0000003d63634085 in abort () from /lib64/libc.so.6
> #2 0x00007f7a452f0e19 in google::DumpStackTraceAndExit () at
> src/utilities.cc:147
> #3 0x00007f7a452e7d5d in google::LogMessage::Fail () at src/logging.cc:1458
> #4 0x00007f7a452ebd77 in google::LogMessage::SendToLog (this=0x7f7a41d8f9d0)
> at src/logging.cc:1412
> #5 0x00007f7a452e9bf9 in google::LogMessage::Flush (this=0x7f7a41d8f9d0) at
> src/logging.cc:1281
> #6 0x00007f7a452e9efd in google::LogMessageFatal::~LogMessageFatal
> (this=0x7f7a41d8f9d0, __in_chrg=<value optimized out>) at src/logging.cc:1984
> #7 0x00007f7a44d6759c in mesos::internal::master::fail (message="Recovery
> failed", failure="Failed to recover registrar: Failed to perform fetch within
> 5mins") at ../../src/master/master.cpp:1016
> #8 0x00007f7a44da75a6 in __call<std::basic_string<char,
> std::char_traits<char>, std::allocator<char> > const&, 0, 1>
> (__functor=<value optimized out>, __args#0=
> "Failed to recover registrar: Failed to perform fetch within 5mins") at
> /usr/lib/gcc/x86_64-redhat-linux/4.4.6/../../../../include/c++/4.4.6/tr1_impl/functional:1137
> #9 operator()<const std::basic_string<char, std::char_traits<char>,
> std::allocator<char> > > (__functor=<value optimized out>, __args#0="Failed
> to recover registrar: Failed to perform fetch within 5mins")
> at
> /usr/lib/gcc/x86_64-redhat-linux/4.4.6/../../../../include/c++/4.4.6/tr1_impl/functional:1191
> #10 std::tr1::_Function_handler<void(const std::string&),
> std::tr1::_Bind<void (*(const char*, std::tr1::_Placeholder<1>))(const
> std::string&, const std::string&)> >::_M_invoke(const std::tr1::_Any_data &,
> const std::string &) (__functor=<value optimized out>, __args#0="Failed to
> recover registrar: Failed to perform fetch within 5mins")
> at
> /usr/lib/gcc/x86_64-redhat-linux/4.4.6/../../../../include/c++/4.4.6/tr1_impl/functional:1668
> #11 0x00007f7a44caff3c in process::Future<Nothing>::fail
> (this=0x7f7a140164f8, _message=<value optimized out>) at
> ../../3rdparty/libprocess/include/process/future.hpp:1628
> #12 0x00007f7a44de1a6a in fail (promise=std::tr1::shared_ptr (count 1)
> 0x7f7a140164f0, f=..., future=<value optimized out>) at
> ../../3rdparty/libprocess/include/process/future.hpp:789
> #13 process::internal::thenf<mesos::internal::Registry, Nothing>(const
> std::tr1::shared_ptr<process::Promise<Nothing> > &, const
> std::tr1::function<process::Future<Nothing>(const
> mesos::internal::Registry&)> &, const
> process::Future<mesos::internal::Registry> &) (promise=std::tr1::shared_ptr
> (count 1) 0x7f7a140164f0, f=..., future=<value optimized out>) at
> ../../3rdparty/libprocess/include/process/future.hpp:1438
> #14 0x00007f7a44e18ffc in process::Future<mesos::internal::Registry>::fail
> (this=0x7f7a2800be68, _message=<value optimized out>) at
> ../../3rdparty/libprocess/include/process/future.hpp:1634
> #15 0x00007f7a44e18f9c in process::Future<mesos::internal::Registry>::fail
> (this=0x7f7a2801c488, _message=<value optimized out>) at
> ../../3rdparty/libprocess/include/process/future.hpp:1628
> #16 0x00007f7a44e0cf4c in fail (this=0x2179b80, info=<value optimized out>,
> recovery=<value optimized out>) at
> ../../3rdparty/libprocess/include/process/future.hpp:789
> #17 mesos::internal::master::RegistrarProcess::_recover (this=0x2179b80,
> info=<value optimized out>, recovery=<value optimized out>) at
> ../../src/master/registrar.cpp:341
> #18 0x00007f7a44e24181 in __call<process::ProcessBase*&, 0, 1>
> (__functor=<value optimized out>, __args#0=<value optimized out>)
> at
> /usr/lib/gcc/x86_64-redhat-linux/4.4.6/../../../../include/c++/4.4.6/tr1_impl/functional:1137
> #19 operator()<process::ProcessBase*> (__functor=<value optimized out>,
> __args#0=<value optimized out>) at
> /usr/lib/gcc/x86_64-redhat-linux/4.4.6/../../../../include/c++/4.4.6/tr1_impl/functional:1191
> #20 std::tr1::_Function_handler<void(process::ProcessBase*),
> std::tr1::_Bind<void (*(std::tr1::_Placeholder<1>,
> std::tr1::shared_ptr<std::tr1::function<void(mesos::internal::master::RegistrarProcess*)>
> >))(process::ProcessBase*,
> std::tr1::shared_ptr<std::tr1::function<void(mesos::internal::master::RegistrarProcess*)>
> >)> >::_M_invoke(const std::tr1::_Any_data &, process::ProcessBase *)
> (__functor=<value optimized out>,
> __args#0=<value optimized out>) at
> /usr/lib/gcc/x86_64-redhat-linux/4.4.6/../../../../include/c++/4.4.6/tr1_impl/functional:1668
> #21 0x00007f7a452814f4 in process::ProcessManager::resume (this=0x214b690,
> process=0x2179e28) at ../../../3rdparty/libprocess/src/process.cpp:2848
> #22 0x00007f7a45281dec in process::schedule (arg=<value optimized out>) at
> ../../../3rdparty/libprocess/src/process.cpp:1479
> #23 0x0000003d63a07851 in start_thread () from /lib64/libpthread.so.0
> #24 0x0000003d636e811d in clone () from /lib64/libc.so.6
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)