[ https://issues.apache.org/jira/browse/MESOS-5193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15237577#comment-15237577 ]
Priyanka Gupta commented on MESOS-5193: --------------------------------------- Error Stack in mesos master log Node3 I0411 22:47:02.007249 1348 detector.cpp:479] A new leading master (UPID=master@10.221.28.61:5050) is detected I0411 22:47:02.007380 1348 master.cpp:1710] The newly elected leader is master@10.221.28.61:5050 with id 725d1232-bea3-4df5-90c5-6479e5652ef4 I0411 22:47:02.007428 1348 master.cpp:1723] Elected as the leading master! I0411 22:47:02.007457 1348 master.cpp:1468] Recovering from registrar I0411 22:47:02.007551 1345 registrar.cpp:307] Recovering registrar I0411 22:47:02.007649 1356 network.hpp:461] ZooKeeper group PIDs: { log-replica(1)@10.221.28.61:5050, log-replica(1)@10.221.28.249:5050 } I0411 22:47:02.007841 1356 log.cpp:659] Attempting to start the writer I0411 22:47:02.008477 1348 replica.cpp:493] Replica received implicit promise request from (30)@10.221.28.61:5050 with proposal 52 E0411 22:47:02.008903 1358 process.cpp:1966] Failed to shutdown socket with fd 23: Transport endpoint is not connected I0411 22:47:02.009968 1348 leveldb.cpp:304] Persisting metadata (8 bytes) to leveldb took 1.44126ms I0411 22:47:02.010022 1348 replica.cpp:342] Persisted promised to 52 F0411 22:48:02.008332 1357 master.cpp:1457] Recovery failed: Failed to recover registrar: Failed to perform fetch within 1mins *** Check failure stack trace: *** @ 0x7f4bd5bcedfd (unknown) @ 0x7f4bd5bd0c3d (unknown) @ 0x7f4bd5bce9ec (unknown) @ 0x7f4bd5bd1539 (unknown) @ 0x7f4bd54022dc (unknown) @ 0x7f4bd5442ab0 (unknown) @ 0x42807e (unknown) @ 0x7f4bd54690a5 (unknown) @ 0x7f4bd54bb976 (unknown) @ 0x7f4bd54cc566 (unknown) @ 0x7f4bd52fc4d6 (unknown) @ 0x7f4bd54cc553 (unknown) @ 0x7f4bd54b0614 (unknown) @ 0x7f4bd5b7c971 (unknown) @ 0x7f4bd5b7cc77 (unknown) @ 0x3dc38b6470 (unknown) @ 0x3dc18079d1 (unknown) @ 0x3dc14e88fd (unknown) @ (nil) (unknown) /bin/bash: line 1: 1313 Aborted /usr/sbin/mesos-master --work_dir=/tmp/mesos_dir --zk=zk://scheduler1.rpt.cb.ne1.yahoo.com:2181,scheduler2.rpt.cb.ne1.yahoo.com:2181,scheduler3.rpt.cb.ne1.yahoo.com:2181/mesos --quorum=2 Node 2 I0411 22:48:10.006216 1466 log.cpp:659] Attempting to start the writer E0411 22:48:10.006958 1478 process.cpp:1966] Failed to shutdown socket with fd 23: Transport endpoint is not connected I0411 22:48:10.007202 1467 replica.cpp:493] Replica received implicit promise request from (13)@10.221.28.249:5050 with proposal 52 E0411 22:48:10.007491 1478 process.cpp:1966] Failed to shutdown socket with fd 23: Transport endpoint is not connected I0411 22:48:10.008458 1467 leveldb.cpp:304] Persisting metadata (8 bytes) to leveldb took 1.227092ms I0411 22:48:10.008491 1467 replica.cpp:342] Persisted promised to 52 F0411 22:49:10.006739 1476 master.cpp:1457] Recovery failed: Failed to recover registrar: Failed to perform fetch within 1mins *** Check failure stack trace: *** @ 0x7fec686f2dfd (unknown) @ 0x7fec686f4c3d (unknown) @ 0x7fec686f29ec (unknown) @ 0x7fec686f5539 (unknown) @ 0x7fec67f262dc (unknown) @ 0x7fec67f66ab0 (unknown) @ 0x42807e (unknown) @ 0x7fec67f8d0a5 (unknown) @ 0x7fec67fdf976 (unknown) @ 0x7fec67ff0566 (unknown) @ 0x7fec67e204d6 (unknown) @ 0x7fec67ff0553 (unknown) @ 0x7fec67fd4614 (unknown) @ 0x7fec686a0971 (unknown) @ 0x7fec686a0c77 (unknown) @ 0x37f98b6470 (unknown) @ 0x39ed207a51 (unknown) @ 0x39ecae89ad (unknown) @ (nil) (unknown) /bin/bash: line 1: 1452 Aborted /usr/sbin/mesos-master --work_dir=/tmp/mesos_dir --zk=zk://scheduler1.rpt.cb.ne1.yahoo.com:2181,scheduler2.rpt.cb.ne1.yahoo.com:2181,scheduler3.rpt.cb.ne1.yahoo.com:2181/mesos --quorum=2 Node 1 I0411 22:45:52.017833 8338 detector.cpp:479] A new leading master (UPID=master@10.221.29.247:5050) is detected I0411 22:45:52.017925 8338 master.cpp:1710] The newly elected leader is master@10.221.29.247:5050 with id 13df6437-fbe9-4390-9f6c-db9fd1d53a16 I0411 22:45:52.017956 8338 master.cpp:1723] Elected as the leading master! I0411 22:45:52.017983 8338 master.cpp:1468] Recovering from registrar I0411 22:45:52.018069 8339 registrar.cpp:307] Recovering registrar I0411 22:45:52.018337 8333 log.cpp:659] Attempting to start the writer I0411 22:45:52.018785 8336 network.hpp:461] ZooKeeper group PIDs: { log-replica(1)@10.221.28.61:5050, log-replica(1)@10.221.29.247:5050 } I0411 22:45:52.019008 8336 replica.cpp:493] Replica received implicit promise request from (31)@10.221.29.247:5050 with proposal 50 E0411 22:45:52.019548 8341 process.cpp:1966] Failed to shutdown socket with fd 24: Transport endpoint is not connected I0411 22:45:52.020465 8336 leveldb.cpp:304] Persisting metadata (8 bytes) to leveldb took 1.421142ms I0411 22:45:52.020496 8336 replica.cpp:342] Persisted promised to 50 I0411 22:46:15.034744 8340 network.hpp:413] ZooKeeper group memberships changed I0411 22:46:15.034867 8334 group.cpp:672] Trying to get '/mesos/log_replicas/0000000346' in ZooKeeper I0411 22:46:15.035729 8334 group.cpp:672] Trying to get '/mesos/log_replicas/0000000347' in ZooKeeper I0411 22:46:15.036533 8334 group.cpp:672] Trying to get '/mesos/log_replicas/0000000348' in ZooKeeper I0411 22:46:15.037353 8335 network.hpp:461] ZooKeeper group PIDs: { log-replica(1)@10.221.28.61:5050, log-replica(1)@10.221.28.249:5050, log-replica(1)@10.221.29.247:5050 } I0411 22:46:27.242632 8336 http.cpp:503] HTTP GET for /master/state.json from 216.145.54.15:54890 with User-Agent='Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:45.0) Gecko/20100101 Firefox/45.0' I0411 22:46:37.292083 8335 http.cpp:503] HTTP GET for /master/state.json from 216.145.54.15:54890 with User-Agent='Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:45.0) Gecko/20100101 Firefox/45.0' I0411 22:46:47.342876 8334 http.cpp:503] HTTP GET for /master/state.json from 216.145.54.15:54890 with User-Agent='Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:45.0) Gecko/20100101 Firefox/45.0' F0411 22:46:52.019045 8333 master.cpp:1457] Recovery failed: Failed to recover registrar: Failed to perform fetch within 1mins *** Check failure stack trace: *** @ 0x7f7ad44badfd (unknown) @ 0x7f7ad44bcc3d (unknown) @ 0x7f7ad44ba9ec (unknown) @ 0x7f7ad44bd539 (unknown) @ 0x7f7ad3cee2dc (unknown) @ 0x7f7ad3d2eab0 (unknown) @ 0x42807e (unknown) @ 0x7f7ad3d550a5 (unknown) @ 0x7f7ad3da7976 (unknown) @ 0x7f7ad3db8566 (unknown) @ 0x7f7ad3be84d6 (unknown) @ 0x7f7ad3db8553 (unknown) @ 0x7f7ad3d9c614 (unknown) @ 0x7f7ad4468971 (unknown) @ 0x7f7ad4468c77 (unknown) @ 0x35282b6470 (unknown) @ 0x35262079d1 (unknown) @ 0x3525ee88fd (unknown) @ (nil) (unknown) /bin/bash: line 1: 8332 Aborted /usr/sbin/mesos-master --work_dir=/tmp/mesos_dir --zk=zk://scheduler1.rpt.cb.ne1.yahoo.com:2181,scheduler2.rpt.cb.ne1.yahoo.com:2181,scheduler3.rpt.cb.ne1.yahoo.com:2181/mesos --quorum=2 > Recovery failed: Failed to recover registrar on reboot of mesos master > ---------------------------------------------------------------------- > > Key: MESOS-5193 > URL: https://issues.apache.org/jira/browse/MESOS-5193 > Project: Mesos > Issue Type: Bug > Components: master > Affects Versions: 0.22.0, 0.27.0 > Reporter: Priyanka Gupta > Labels: master, mesosphere > > Hi all, > We are using a 3 node cluster with mesos master, mesos slave and zookeeper on > all of them. We are using chronos on top of it. The problem is when we reboot > the mesos master leader, the other nodes try to get elected as leader but > fail with recovery registrar issue. > "Recovery failed: Failed to recover registrar: Failed to perform fetch within > 1mins" > The next node then try to become the leader but again fails with same error. > I am not sure about the issue. We are currently using mesos 0.22 and also > tried to upgrade to mesos 0.27 as well but the problem continues to happen. > /usr/sbin/mesos-master --work_dir=/tmp/mesos_dir > --zk=zk://node1:2181,node2:2181,node3:2181/mesos --quorum=2 > Can you please help us resolve this issue as its a production system. > Thanks, > Priyanka -- This message was sent by Atlassian JIRA (v6.3.4#6332)