Hi devs:

Our Mesos master suicide after ZooKeeper session expired several times
in a day with the following msg:


2011-12-16 17:06:35,594:7696(0x7f824874f700):ZOO_DEBUG@zoo_awget_children_@2612:
Sending request xid=0x4eeb07ce for path [/mesos_master] to
192.168.1.32:2181
2011-12-16 
17:06:35,594:7696(0x7f8246f4c700):ZOO_ERROR@handle_socket_error_msg@1621:
Socket [192.168.1.32:2181] zk retcode=-112, errno=116(Stale NFS file
handle): sessionId=0x333cf926ee322ef has expired.
2011-12-16 17:06:35,594:7696(0x7f8246f4c700):ZOO_DEBUG@handle_error@1138:
Calling a watcher for a ZOO_SESSION_EVENT and the
state=ZOO_EXPIRED_SESSION_STATE
2011-12-16 17:06:35,594:7696(0x7f8246f4c700):ZOO_DEBUG@do_io@317: IO
thread terminated
2011-12-16 17:06:35,594:7696(0x7f824674b700):ZOO_DEBUG@process_completions@1765:
Calling a watcher for node [], type = -1 event=ZOO_SESSION_EVENT
2011-12-16 17:06:35,594:7696(0x7f824674b700):ZOO_DEBUG@process_completions@1795:
Calling COMPLETION_STRINGLIST for xid=0x4eeb07ce rc=-112
E1216 17:06:35.594744  7699 detector.cpp:559] Master detector failed
to get masters: session expired
I1216 17:06:48.371737  7699 detector.cpp:457] Master detector lost
connection to ZooKeeper, attempting to reconnect ...
W1216 17:06:48.371817  7699 detector.cpp:500] Master detector
ZooKeeper session expired!
2011-12-16 17:06:48,371:7696(0x7f824674b700):ZOO_DEBUG@do_completion@336:
completion thread terminated
2011-12-16 17:06:48,371:7696(0x7f824874f700):ZOO_INFO@zookeeper_close@2307:
Freeing zookeeper resources for sessionId=0x333cf926ee322ef
F1216 17:06:48.374893  7699 master.cpp:475] No longer elected master
... committing suicide!
*** Check failure stack trace: ***
    @           0x5aa61d  google::LogMessage::Fail()
    @           0x5ace2f  google::LogMessage::SendToLog()
    @           0x5aa206  google::LogMessage::Flush()
    @           0x5ad6ad  google::LogMessageFatal::~LogMessageFatal()
    @           0x439951  mesos::internal::master::Master::noMasterDetected()
    @           0x440158  mesos::internal::master::Master::operator()()
    @           0x5bfc39  process::ProcessManager::run()
    @           0x5bfd8a  process::trampoline()
    @     0x7f8248791160  (unknown)





At some times, it can reconnect to ZooKeeper without suicide:


2011-12-16 17:01:28,580:7696(0x7f824674b700):ZOO_WARN@zookeeper_interest@1461:
Exceeded deadline by 13ms
2011-12-16 17:01:28,581:7696(0x7f824674b700):ZOO_DEBUG@zookeeper_process@1974:
Got ping response in 0 ms
2011-12-16 17:01:31,921:7696(0x7f824674b700):ZOO_DEBUG@zookeeper_process@1974:
Got ping response in 0 ms
2011-12-16 17:01:54,649:7696(0x7f824674b700):ZOO_WARN@zookeeper_interest@1461:
Exceeded deadline by 19401ms
2011-12-16 
17:01:54,649:7696(0x7f824674b700):ZOO_ERROR@handle_socket_error_msg@1528:
Socket [192.168.1.33:2181] zk retcode=-7, errno=110(Connection timed
out): connection timed out (exceeded timeout by 16062ms)
2011-12-16 17:01:54,649:7696(0x7f824674b700):ZOO_DEBUG@handle_error@1141:
Calling a watcher for a ZOO_SESSION_EVENT and the
state=CONNECTING_STATE
2011-12-16 17:01:54,649:7696(0x7f824674b700):ZOO_WARN@zookeeper_interest@1461:
Exceeded deadline by 19401ms
2011-12-16 17:01:54,649:7696(0x7f8246f4c700):ZOO_DEBUG@process_completions@1765:
Calling a watcher for node [], type = -1 event=ZOO_SESSION_EVENT
I1216 17:01:54.657625  7699 detector.cpp:457] Master detector lost
connection to ZooKeeper, attempting to reconnect ...
2011-12-16 17:01:54,660:7696(0x7f824674b700):ZOO_INFO@check_events@1585:
initiated connection to server [192.168.1.34:2181]
2011-12-16 
17:01:54,662:7696(0x7f824674b700):ZOO_ERROR@handle_socket_error_msg@1621:
Socket [192.168.1.34:2181] zk retcode=-112, errno=116(Stale NFS file
handle): sessionId=0x333cf926ee322ee has expired.
2011-12-16 17:01:54,662:7696(0x7f824674b700):ZOO_DEBUG@handle_error@1138:
Calling a watcher for a ZOO_SESSION_EVENT and the
state=ZOO_EXPIRED_SESSION_STATE
2011-12-16 17:01:54,662:7696(0x7f824674b700):ZOO_DEBUG@do_io@317: IO
thread terminated
2011-12-16 17:01:54,662:7696(0x7f8246f4c700):ZOO_DEBUG@process_completions@1765:
Calling a watcher for node [], type = -1 event=ZOO_SESSION_EVENT
W1216 17:01:54.662583  7699 detector.cpp:500] Master detector
ZooKeeper session expired!
2011-12-16 17:01:54,662:7696(0x7f8246f4c700):ZOO_DEBUG@do_completion@336:
completion thread terminated
2011-12-16 17:01:54,662:7696(0x7f824874f700):ZOO_INFO@zookeeper_close@2307:
Freeing zookeeper resources for sessionId=0x333cf926ee322ee

2011-12-16 17:01:54,662:7696(0x7f824874f700):ZOO_INFO@log_env@658:
Client environment:zookeeper.version=zookeeper C client 3.3.1
2011-12-16 17:01:54,662:7696(0x7f824874f700):ZOO_INFO@log_env@662:
Client environment:host.name=boromir
2011-12-16 17:01:54,662:7696(0x7f824874f700):ZOO_INFO@log_env@669:
Client environment:os.name=Linux
2011-12-16 17:01:54,662:7696(0x7f824874f700):ZOO_INFO@log_env@670:
Client environment:os.arch=2.6.39-gentoo-r3
2011-12-16 17:01:54,662:7696(0x7f824874f700):ZOO_INFO@log_env@671:
Client environment:os.version=#4 SMP Mon Sep 19 17:12:59 CST 2011
2011-12-16 17:01:54,663:7696(0x7f824874f700):ZOO_INFO@log_env@679:
Client environment:user.name=(null)
2011-12-16 17:01:54,663:7696(0x7f824874f700):ZOO_INFO@log_env@687:
Client environment:user.home=/root
2011-12-16 17:01:54,663:7696(0x7f824874f700):ZOO_INFO@log_env@699:
Client environment:user.dir=/opt/mesos/bin
2011-12-16 17:01:54,663:7696(0x7f824874f700):ZOO_INFO@zookeeper_init@727:
Initiating client connection,
host=zk1:2181,zk2:2181,zk3:2181,zk4:2181,zk5:2181 sessionTimeout=10000
watcher=0x4cb750 sessionId=0 sessionPasswd=<null>
context=0x7f823803c380 flags=0
2011-12-16 17:01:54,663:7696(0x7f824874f700):ZOO_DEBUG@start_threads@152:
starting threads...
2011-12-16 17:01:54,680:7696(0x7f824674b700):ZOO_DEBUG@do_completion@326:
started completion thread
2011-12-16 17:01:54,680:7696(0x7f8246f4c700):ZOO_DEBUG@do_io@279:
started IO thread
2011-12-16 17:01:54,680:7696(0x7f8246f4c700):ZOO_INFO@check_events@1585:
initiated connection to server [192.168.1.33:2181]
2011-12-16 17:01:54,719:7696(0x7f8246f4c700):ZOO_INFO@check_events@1632:
session establishment complete on server [192.168.1.33:2181],
sessionId=0x333cf926ee322ef, negotiated timeout=10000
2011-12-16 17:01:54,719:7696(0x7f8246f4c700):ZOO_DEBUG@check_events@1638:
Calling a watcher for a ZOO_SESSION_EVENT and the
state=ZOO_CONNECTED_STATE
2011-12-16 17:01:54,719:7696(0x7f824674b700):ZOO_DEBUG@process_completions@1765:
Calling a watcher for node [], type = -1 event=ZOO_SESSION_EVENT
I1216 17:01:54.719355  7699 detector.cpp:375] Master detector
connected to ZooKeeper ...
I1216 17:01:54.719421  7699 detector.cpp:404] Trying to create znode
'/mesos_master' in ZooKeeper
2011-12-16 17:01:54,719:7696(0x7f824874f700):ZOO_DEBUG@zoo_acreate@2489:
Sending request xid=0x4eeb07c9 for path [/mesos_master] to
192.168.1.33:2181
2011-12-16 17:01:54,723:7696(0x7f8246f4c700):ZOO_DEBUG@zookeeper_process@1980:
Queueing asynchronous response
2011-12-16 17:01:54,723:7696(0x7f824674b700):ZOO_DEBUG@process_completions@1817:
Calling COMPLETION_STRING for xid=0x4eeb07c9 rc=-110
2011-12-16 17:01:54,724:7696(0x7f824874f700):ZOO_DEBUG@zoo_awget@2400:
Sending request xid=0x4eeb07ca for path [/mesos_master] to
192.168.1.33:2181
2011-12-16 17:01:54,724:7696(0x7f8246f4c700):ZOO_DEBUG@zookeeper_process@1980:
Queueing asynchronous response
2011-12-16 17:01:54,724:7696(0x7f824674b700):ZOO_DEBUG@process_completions@1772:
Calling COMPLETION_DATA for xid=0x4eeb07ca rc=0
2011-12-16 17:01:54,724:7696(0x7f824874f700):ZOO_DEBUG@zoo_acreate@2489:
Sending request xid=0x4eeb07cb for path [/mesos_master/] to
192.168.1.33:2181
2011-12-16 17:01:54,726:7696(0x7f8246f4c700):ZOO_DEBUG@zookeeper_process@1980:
Queueing asynchronous response
2011-12-16 17:01:54,726:7696(0x7f824674b700):ZOO_DEBUG@process_completions@1817:
Calling COMPLETION_STRING for xid=0x4eeb07cb rc=0
I1216 17:01:54.726915  7699 detector.cpp:443] Created
ephemeral/sequence:0000000344
2011-12-16 17:01:54,727:7696(0x7f824874f700):ZOO_DEBUG@zoo_awget_children_@2612:
Sending request xid=0x4eeb07cc for path [/mesos_master] to
192.168.1.33:2181
W1216 17:01:54.727063  7699 master.cpp:299] Dropping unknown message
'mesos.internal.GotMasterTokenMessage' from: @0.0.0.0:0
2011-12-16 17:01:54,727:7696(0x7f8246f4c700):ZOO_DEBUG@zookeeper_process@1980:
Queueing asynchronous response
2011-12-16 17:01:54,727:7696(0x7f824674b700):ZOO_DEBUG@process_completions@1795:
Calling COMPLETION_STRINGLIST for xid=0x4eeb07cc rc=0
I1216 17:01:54.727489  7699 detector.cpp:562] Master detector found 1
registered masters
2011-12-16 17:01:54,727:7696(0x7f824874f700):ZOO_DEBUG@zoo_awget@2400:
Sending request xid=0x4eeb07cd for path [/mesos_master/0000000344] to
192.168.1.33:2181
2011-12-16 17:01:54,728:7696(0x7f8246f4c700):ZOO_DEBUG@zookeeper_process@1980:
Queueing asynchronous response
2011-12-16 17:01:54,728:7696(0x7f824674b700):ZOO_DEBUG@process_completions@1772:
Calling COMPLETION_DATA for xid=0x4eeb07cd rc=0
I1216 17:01:54.728397  7699 detector.cpp:592] Master detector got new
master pid: [email protected]:5050
I1216 17:01:54.728489  7699 master.cpp:467] Still acting as master!

Our Mesos is lastest master branch at github, ZooKeeper is 3.3.3, but
the client is the included one, which is 3.3.1,
is it the reason ?

One week before, we use Mesos 0.3-alpha, it's more stable with
ZooKeeper than now.

Any suggestion are welcomed, thank you.

-- 
 - Davies

Reply via email to