[ 
https://issues.apache.org/jira/browse/MESOS-2014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14194888#comment-14194888
 ] 

Benjamin Mahler commented on MESOS-2014:
----------------------------------------

Hi [~jesson], you need to keep a quorum of masters online for a master to 
successfully recover. Typically this means running the master under something 
(like Monit) that ensures that a downed master process will be restarted 
promptly, on the order of seconds. Are you doing that?

> error of Recovery failed: Failed to recover registrar: Failed to perform 
> fetch within 5mins
> -------------------------------------------------------------------------------------------
>
>                 Key: MESOS-2014
>                 URL: https://issues.apache.org/jira/browse/MESOS-2014
>             Project: Mesos
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.20.1
>         Environment: CentOS 6.3
>  3.10.5-12.1.x86_64 #1 SMP Fri Aug 16 01:42:38 UTC 2013 x86_64 x86_64 x86_64 
> GNU/Linux
>            Reporter: Ji Huang
>
> I set  up a mesos master cluster with 3 nodes. at the first, everything goes 
> well, but when the leader master had dead, other candidate node  can not 
> recovery and elect new leader, all of candidate node will dead too.
> I1030 15:01:32.005691  6741 detector.cpp:138] Detected a new leader: (id='16')
> I1030 15:01:32.005692  6737 network.hpp:423] ZooKeeper group memberships 
> changed
> I1030 15:01:32.006089  6741 group.cpp:658] Trying to get 
> '/mesos/info_0000000016' in ZooKeeper
> I1030 15:01:32.006222  6738 group.cpp:658] Trying to get 
> '/mesos/log_replicas/0000000015' in ZooKeeper
> I1030 15:01:32.007230  6738 group.cpp:658] Trying to get 
> '/mesos/log_replicas/0000000016' in ZooKeeper
> I1030 15:01:32.007268  6736 detector.cpp:426] A new leading master 
> ([email protected]:5050) is detected
> I1030 15:01:32.007546  6742 master.cpp:1196] The newly elected leader is 
> [email protected]:5050 with id 20141030-150042-94987018-5050-6735
> I1030 15:01:32.007640  6742 master.cpp:1209] Elected as the leading master!
> I1030 15:01:32.007730  6742 master.cpp:1027] Recovering from registrar
> I1030 15:01:32.007895  6736 registrar.cpp:313] Recovering registrar
> I1030 15:01:32.008388  6742 network.hpp:461] ZooKeeper group PIDs: { 
> log-replica(1)@10.99.169.5:5050, log-replica(1)@10.99.169.6:5050 }
> I1030 15:01:32.051316  6742 replica.cpp:638] Replica in EMPTY status received 
> a broadcasted recover request
> I1030 15:01:32.889194  6738 replica.cpp:638] Replica in EMPTY status received 
> a broadcasted recover request
> I1030 15:01:33.469511  6743 replica.cpp:638] Replica in EMPTY status received 
> a broadcasted recover request
> I1030 15:01:34.324684  6740 replica.cpp:638] Replica in EMPTY status received 
> a broadcasted recover request
> I1030 15:01:35.263629  6736 replica.cpp:638] Replica in EMPTY status received 
> a broadcasted recover request
> I1030 15:01:36.212492  6739 replica.cpp:638] Replica in EMPTY status received 
> a broadcasted recover request
> I1030 15:01:37.015682  6742 replica.cpp:638] Replica in EMPTY status received 
> a broadcasted recover request
> I1030 15:01:37.781746  6743 replica.cpp:638] Replica in EMPTY status received 
> a broadcasted recover request
> I1030 15:01:38.494547  6737 replica.cpp:638] Replica in EMPTY status received 
> a broadcasted recover request
> I1030 15:01:39.186830  6740 replica.cpp:638] Replica in EMPTY status received 
> a broadcasted recover request
> I1030 15:01:40.072258  6736 replica.cpp:638] Replica in EMPTY status received 
> a broadcasted recover request
> I1030 15:01:40.855337  6743 replica.cpp:638] Replica in EMPTY status received 
> a broadcasted recover request
> I1030 15:01:41.516916  6739 replica.cpp:638] Replica in EMPTY status received 
> a broadcasted recover request
> I1030 15:01:41.556437  6744 recover.cpp:111] Unable to finish the recover 
> protocol in 10secs, retrying
> I1030 15:01:41.557253  6741 replica.cpp:638] Replica in EMPTY status received 
> a broadcasted recover request
> I1030 15:01:41.557502  6739 recover.cpp:188] Received a recover response from 
> a replica in EMPTY status
> I1030 15:01:41.558156  6741 recover.cpp:188] Received a recover response from 
> a replica in EMPTY status
> I1030 15:01:42.153370  6737 replica.cpp:638] Replica in EMPTY status received 
> a broadcasted recover request
> I1030 15:01:42.505698  6742 replica.cpp:638] Replica in EMPTY status received 
> a broadcasted recover request
> I1030 15:01:42.506060  6738 recover.cpp:188] Received a recover response from 
> a replica in EMPTY status
> I1030 15:01:42.507046  6742 recover.cpp:188] Received a recover response from 
> a replica in EMPTY status
> ......
> F1030 15:06:32.009464  6741 master.cpp:1016] Recovery failed: Failed to 
> recover registrar: Failed to perform fetch within 5mins
> Core dump info:
> #0  0x0000003d636328a5 in raise () from /lib64/libc.so.6
> #1  0x0000003d63634085 in abort () from /lib64/libc.so.6
> #2  0x00007f7a452f0e19 in google::DumpStackTraceAndExit () at 
> src/utilities.cc:147
> #3  0x00007f7a452e7d5d in google::LogMessage::Fail () at src/logging.cc:1458
> #4  0x00007f7a452ebd77 in google::LogMessage::SendToLog (this=0x7f7a41d8f9d0) 
> at src/logging.cc:1412
> #5  0x00007f7a452e9bf9 in google::LogMessage::Flush (this=0x7f7a41d8f9d0) at 
> src/logging.cc:1281
> #6  0x00007f7a452e9efd in google::LogMessageFatal::~LogMessageFatal 
> (this=0x7f7a41d8f9d0, __in_chrg=<value optimized out>) at src/logging.cc:1984
> #7  0x00007f7a44d6759c in mesos::internal::master::fail (message="Recovery 
> failed", failure="Failed to recover registrar: Failed to perform fetch within 
> 5mins") at ../../src/master/master.cpp:1016
> #8  0x00007f7a44da75a6 in __call<std::basic_string<char, 
> std::char_traits<char>, std::allocator<char> > const&, 0, 1> 
> (__functor=<value optimized out>, __args#0=
>     "Failed to recover registrar: Failed to perform fetch within 5mins") at 
> /usr/lib/gcc/x86_64-redhat-linux/4.4.6/../../../../include/c++/4.4.6/tr1_impl/functional:1137
> #9  operator()<const std::basic_string<char, std::char_traits<char>, 
> std::allocator<char> > > (__functor=<value optimized out>, __args#0="Failed 
> to recover registrar: Failed to perform fetch within 5mins")
>     at 
> /usr/lib/gcc/x86_64-redhat-linux/4.4.6/../../../../include/c++/4.4.6/tr1_impl/functional:1191
> #10 std::tr1::_Function_handler<void(const std::string&), 
> std::tr1::_Bind<void (*(const char*, std::tr1::_Placeholder<1>))(const 
> std::string&, const std::string&)> >::_M_invoke(const std::tr1::_Any_data &, 
> const std::string &) (__functor=<value optimized out>, __args#0="Failed to 
> recover registrar: Failed to perform fetch within 5mins")
>     at 
> /usr/lib/gcc/x86_64-redhat-linux/4.4.6/../../../../include/c++/4.4.6/tr1_impl/functional:1668
> #11 0x00007f7a44caff3c in process::Future<Nothing>::fail 
> (this=0x7f7a140164f8, _message=<value optimized out>) at 
> ../../3rdparty/libprocess/include/process/future.hpp:1628
> #12 0x00007f7a44de1a6a in fail (promise=std::tr1::shared_ptr (count 1) 
> 0x7f7a140164f0, f=..., future=<value optimized out>) at 
> ../../3rdparty/libprocess/include/process/future.hpp:789
> #13 process::internal::thenf<mesos::internal::Registry, Nothing>(const 
> std::tr1::shared_ptr<process::Promise<Nothing> > &, const 
> std::tr1::function<process::Future<Nothing>(const 
> mesos::internal::Registry&)> &, const 
> process::Future<mesos::internal::Registry> &) (promise=std::tr1::shared_ptr 
> (count 1) 0x7f7a140164f0, f=..., future=<value optimized out>) at 
> ../../3rdparty/libprocess/include/process/future.hpp:1438
> #14 0x00007f7a44e18ffc in process::Future<mesos::internal::Registry>::fail 
> (this=0x7f7a2800be68, _message=<value optimized out>) at 
> ../../3rdparty/libprocess/include/process/future.hpp:1634
> #15 0x00007f7a44e18f9c in process::Future<mesos::internal::Registry>::fail 
> (this=0x7f7a2801c488, _message=<value optimized out>) at 
> ../../3rdparty/libprocess/include/process/future.hpp:1628
> #16 0x00007f7a44e0cf4c in fail (this=0x2179b80, info=<value optimized out>, 
> recovery=<value optimized out>) at 
> ../../3rdparty/libprocess/include/process/future.hpp:789
> #17 mesos::internal::master::RegistrarProcess::_recover (this=0x2179b80, 
> info=<value optimized out>, recovery=<value optimized out>) at 
> ../../src/master/registrar.cpp:341
> #18 0x00007f7a44e24181 in __call<process::ProcessBase*&, 0, 1> 
> (__functor=<value optimized out>, __args#0=<value optimized out>)
>     at 
> /usr/lib/gcc/x86_64-redhat-linux/4.4.6/../../../../include/c++/4.4.6/tr1_impl/functional:1137
> #19 operator()<process::ProcessBase*> (__functor=<value optimized out>, 
> __args#0=<value optimized out>) at 
> /usr/lib/gcc/x86_64-redhat-linux/4.4.6/../../../../include/c++/4.4.6/tr1_impl/functional:1191
> #20 std::tr1::_Function_handler<void(process::ProcessBase*), 
> std::tr1::_Bind<void (*(std::tr1::_Placeholder<1>, 
> std::tr1::shared_ptr<std::tr1::function<void(mesos::internal::master::RegistrarProcess*)>
>  >))(process::ProcessBase*, 
> std::tr1::shared_ptr<std::tr1::function<void(mesos::internal::master::RegistrarProcess*)>
>  >)> >::_M_invoke(const std::tr1::_Any_data &, process::ProcessBase *) 
> (__functor=<value optimized out>, 
>     __args#0=<value optimized out>) at 
> /usr/lib/gcc/x86_64-redhat-linux/4.4.6/../../../../include/c++/4.4.6/tr1_impl/functional:1668
> #21 0x00007f7a452814f4 in process::ProcessManager::resume (this=0x214b690, 
> process=0x2179e28) at ../../../3rdparty/libprocess/src/process.cpp:2848
> #22 0x00007f7a45281dec in process::schedule (arg=<value optimized out>) at 
> ../../../3rdparty/libprocess/src/process.cpp:1479
> #23 0x0000003d63a07851 in start_thread () from /lib64/libpthread.so.0
> #24 0x0000003d636e811d in clone () from /lib64/libc.so.6



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to