[jira] [Commented] (MESOS-5193) Recovery failed: Failed to recover registrar on reboot of mesos master

Priyanka Gupta (JIRA) Tue, 12 Apr 2016 10:23:45 -0700

    [ 
https://issues.apache.org/jira/browse/MESOS-5193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15237577#comment-15237577
 ]


Priyanka Gupta commented on MESOS-5193:
---------------------------------------

Error Stack in mesos master log

Node3
I0411 22:47:02.007249  1348 detector.cpp:479] A new leading master 
([email protected]:5050) is detected
I0411 22:47:02.007380  1348 master.cpp:1710] The newly elected leader is 
[email protected]:5050 with id 725d1232-bea3-4df5-90c5-6479e5652ef4
I0411 22:47:02.007428  1348 master.cpp:1723] Elected as the leading master!
I0411 22:47:02.007457  1348 master.cpp:1468] Recovering from registrar
I0411 22:47:02.007551  1345 registrar.cpp:307] Recovering registrar
I0411 22:47:02.007649  1356 network.hpp:461] ZooKeeper group PIDs: { 
log-replica(1)@10.221.28.61:5050, log-replica(1)@10.221.28.249:5050 }
I0411 22:47:02.007841  1356 log.cpp:659] Attempting to start the writer
I0411 22:47:02.008477  1348 replica.cpp:493] Replica received implicit promise 
request from (30)@10.221.28.61:5050 with proposal 52
E0411 22:47:02.008903  1358 process.cpp:1966] Failed to shutdown socket with fd 
23: Transport endpoint is not connected
I0411 22:47:02.009968  1348 leveldb.cpp:304] Persisting metadata (8 bytes) to 
leveldb took 1.44126ms
I0411 22:47:02.010022  1348 replica.cpp:342] Persisted promised to 52
F0411 22:48:02.008332  1357 master.cpp:1457] Recovery failed: Failed to recover 
registrar: Failed to perform fetch within 1mins
*** Check failure stack trace: ***
    @     0x7f4bd5bcedfd  (unknown)
    @     0x7f4bd5bd0c3d  (unknown)
    @     0x7f4bd5bce9ec  (unknown)
    @     0x7f4bd5bd1539  (unknown)
    @     0x7f4bd54022dc  (unknown)
    @     0x7f4bd5442ab0  (unknown)
    @           0x42807e  (unknown)
    @     0x7f4bd54690a5  (unknown)
    @     0x7f4bd54bb976  (unknown)
    @     0x7f4bd54cc566  (unknown)
    @     0x7f4bd52fc4d6  (unknown)
    @     0x7f4bd54cc553  (unknown)
    @     0x7f4bd54b0614  (unknown)
    @     0x7f4bd5b7c971  (unknown)
    @     0x7f4bd5b7cc77  (unknown)
    @       0x3dc38b6470  (unknown)
    @       0x3dc18079d1  (unknown)
    @       0x3dc14e88fd  (unknown)
    @              (nil)  (unknown)
/bin/bash: line 1:  1313 Aborted                 /usr/sbin/mesos-master 
--work_dir=/tmp/mesos_dir 
--zk=zk://scheduler1.rpt.cb.ne1.yahoo.com:2181,scheduler2.rpt.cb.ne1.yahoo.com:2181,scheduler3.rpt.cb.ne1.yahoo.com:2181/mesos
 --quorum=2



Node 2

I0411 22:48:10.006216  1466 log.cpp:659] Attempting to start the writer
E0411 22:48:10.006958  1478 process.cpp:1966] Failed to shutdown socket with fd 
23: Transport endpoint is not connected
I0411 22:48:10.007202  1467 replica.cpp:493] Replica received implicit promise 
request from (13)@10.221.28.249:5050 with proposal 52
E0411 22:48:10.007491  1478 process.cpp:1966] Failed to shutdown socket with fd 
23: Transport endpoint is not connected
I0411 22:48:10.008458  1467 leveldb.cpp:304] Persisting metadata (8 bytes) to 
leveldb took 1.227092ms
I0411 22:48:10.008491  1467 replica.cpp:342] Persisted promised to 52
F0411 22:49:10.006739  1476 master.cpp:1457] Recovery failed: Failed to recover 
registrar: Failed to perform fetch within 1mins
*** Check failure stack trace: ***
    @     0x7fec686f2dfd  (unknown)
    @     0x7fec686f4c3d  (unknown)
    @     0x7fec686f29ec  (unknown)
    @     0x7fec686f5539  (unknown)
    @     0x7fec67f262dc  (unknown)
    @     0x7fec67f66ab0  (unknown)
    @           0x42807e  (unknown)
    @     0x7fec67f8d0a5  (unknown)
    @     0x7fec67fdf976  (unknown)
    @     0x7fec67ff0566  (unknown)
    @     0x7fec67e204d6  (unknown)
    @     0x7fec67ff0553  (unknown)
    @     0x7fec67fd4614  (unknown)
    @     0x7fec686a0971  (unknown)
    @     0x7fec686a0c77  (unknown)
    @       0x37f98b6470  (unknown)
    @       0x39ed207a51  (unknown)
    @       0x39ecae89ad  (unknown)
    @              (nil)  (unknown)
/bin/bash: line 1:  1452 Aborted                 /usr/sbin/mesos-master 
--work_dir=/tmp/mesos_dir 
--zk=zk://scheduler1.rpt.cb.ne1.yahoo.com:2181,scheduler2.rpt.cb.ne1.yahoo.com:2181,scheduler3.rpt.cb.ne1.yahoo.com:2181/mesos
 --quorum=2



Node 1
I0411 22:45:52.017833  8338 detector.cpp:479] A new leading master 
([email protected]:5050) is detected
I0411 22:45:52.017925  8338 master.cpp:1710] The newly elected leader is 
[email protected]:5050 with id 13df6437-fbe9-4390-9f6c-db9fd1d53a16
I0411 22:45:52.017956  8338 master.cpp:1723] Elected as the leading master!
I0411 22:45:52.017983  8338 master.cpp:1468] Recovering from registrar
I0411 22:45:52.018069  8339 registrar.cpp:307] Recovering registrar
I0411 22:45:52.018337  8333 log.cpp:659] Attempting to start the writer
I0411 22:45:52.018785  8336 network.hpp:461] ZooKeeper group PIDs: { 
log-replica(1)@10.221.28.61:5050, log-replica(1)@10.221.29.247:5050 }
I0411 22:45:52.019008  8336 replica.cpp:493] Replica received implicit promise 
request from (31)@10.221.29.247:5050 with proposal 50
E0411 22:45:52.019548  8341 process.cpp:1966] Failed to shutdown socket with fd 
24: Transport endpoint is not connected
I0411 22:45:52.020465  8336 leveldb.cpp:304] Persisting metadata (8 bytes) to 
leveldb took 1.421142ms
I0411 22:45:52.020496  8336 replica.cpp:342] Persisted promised to 50
I0411 22:46:15.034744  8340 network.hpp:413] ZooKeeper group memberships changed
I0411 22:46:15.034867  8334 group.cpp:672] Trying to get 
'/mesos/log_replicas/0000000346' in ZooKeeper
I0411 22:46:15.035729  8334 group.cpp:672] Trying to get 
'/mesos/log_replicas/0000000347' in ZooKeeper
I0411 22:46:15.036533  8334 group.cpp:672] Trying to get 
'/mesos/log_replicas/0000000348' in ZooKeeper
I0411 22:46:15.037353  8335 network.hpp:461] ZooKeeper group PIDs: { 
log-replica(1)@10.221.28.61:5050, log-replica(1)@10.221.28.249:5050, 
log-replica(1)@10.221.29.247:5050 }
I0411 22:46:27.242632  8336 http.cpp:503] HTTP GET for /master/state.json from 
216.145.54.15:54890 with User-Agent='Mozilla/5.0 (Macintosh; Intel Mac OS X 
10.9; rv:45.0) Gecko/20100101 Firefox/45.0'
I0411 22:46:37.292083  8335 http.cpp:503] HTTP GET for /master/state.json from 
216.145.54.15:54890 with User-Agent='Mozilla/5.0 (Macintosh; Intel Mac OS X 
10.9; rv:45.0) Gecko/20100101 Firefox/45.0'
I0411 22:46:47.342876  8334 http.cpp:503] HTTP GET for /master/state.json from 
216.145.54.15:54890 with User-Agent='Mozilla/5.0 (Macintosh; Intel Mac OS X 
10.9; rv:45.0) Gecko/20100101 Firefox/45.0'
F0411 22:46:52.019045  8333 master.cpp:1457] Recovery failed: Failed to recover 
registrar: Failed to perform fetch within 1mins
*** Check failure stack trace: ***
    @     0x7f7ad44badfd  (unknown)
    @     0x7f7ad44bcc3d  (unknown)
    @     0x7f7ad44ba9ec  (unknown)
    @     0x7f7ad44bd539  (unknown)
    @     0x7f7ad3cee2dc  (unknown)
    @     0x7f7ad3d2eab0  (unknown)
    @           0x42807e  (unknown)
    @     0x7f7ad3d550a5  (unknown)
    @     0x7f7ad3da7976  (unknown)
    @     0x7f7ad3db8566  (unknown)
    @     0x7f7ad3be84d6  (unknown)
    @     0x7f7ad3db8553  (unknown)
    @     0x7f7ad3d9c614  (unknown)
    @     0x7f7ad4468971  (unknown)
    @     0x7f7ad4468c77  (unknown)
    @       0x35282b6470  (unknown)
    @       0x35262079d1  (unknown)
    @       0x3525ee88fd  (unknown)
    @              (nil)  (unknown)
/bin/bash: line 1:  8332 Aborted                 /usr/sbin/mesos-master 
--work_dir=/tmp/mesos_dir 
--zk=zk://scheduler1.rpt.cb.ne1.yahoo.com:2181,scheduler2.rpt.cb.ne1.yahoo.com:2181,scheduler3.rpt.cb.ne1.yahoo.com:2181/mesos
 --quorum=2

> Recovery failed: Failed to recover registrar on reboot of mesos master
> ----------------------------------------------------------------------
>
>                 Key: MESOS-5193
>                 URL: https://issues.apache.org/jira/browse/MESOS-5193
>             Project: Mesos
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.22.0, 0.27.0
>            Reporter: Priyanka Gupta
>              Labels: master, mesosphere
>
> Hi all, 
> We are using a 3 node cluster with mesos master, mesos slave and zookeeper on 
> all of them. We are using chronos on top of it. The problem is when we reboot 
> the mesos master leader, the other nodes try to get elected as leader but 
> fail with recovery registrar issue. 
> "Recovery failed: Failed to recover registrar: Failed to perform fetch within 
> 1mins"
> The next node then try to become the leader but again fails with same error. 
> I am not sure about the issue. We are currently using mesos 0.22 and also 
> tried to upgrade to mesos 0.27 as well but the problem continues to happen. 
>  /usr/sbin/mesos-master --work_dir=/tmp/mesos_dir 
> --zk=zk://node1:2181,node2:2181,node3:2181/mesos --quorum=2
> Can you please help us resolve this issue as its a production system.
> Thanks,
> Priyanka



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-5193) Recovery failed: Failed to recover registrar on reboot of mesos master

Reply via email to