I have found the reason causing zookeeper session expiration, the logs/ located at a very slow disk, logging operation block the master process to send ping to zookeeper. then we move the logs/ into /dev/shm to solve the problem.
On Sun, Dec 18, 2011 at 10:01 AM, Davies Liu <[email protected]> wrote: > Thank you for the reply. > > We run master under deamon-tools, so it can restart after shutdown. > When it restart, some msg from master or slaves may be lost, > then some not finished application will hang, waiting for tasks to > finish for ever. > > If the master do not restart so many times, the apps will run more stable. > > Why the session expiration occurs in ZooKeeper ? It was expected or not ? > > Davies > > On Sun, Dec 18, 2011 at 2:01 AM, Benjamin Hindman > <[email protected]> wrote: >> Currently, the master is programmed to shutdown if a session expiration >> occurs. This could be considered a bug, because when only using one master >> once ZooKeeper has sorted itself out that master will obviously get >> re-elected (translation: I'd be happy if you filed a bug). In the mean >> time, most HA deployments of Mesos have used multiple masters (either on >> separate machines or not). When one master kills itself another takes over >> and the cluster continues to operate as normal. >> >> Hope that helps! >> >> >> On Fri, Dec 16, 2011 at 2:03 AM, Davies Liu <[email protected]> wrote: >> >>> Hi devs: >>> >>> Our Mesos master suicide after ZooKeeper session expired several times >>> in a day with the following msg: >>> >>> >>> 2011-12-16 17:06:35,594:7696(0x7f824874f700):ZOO_DEBUG@zoo_awget_children_ >>> @2612: >>> Sending request xid=0x4eeb07ce for path [/mesos_master] to >>> 192.168.1.32:2181 >>> 2011-12-16 >>> 17:06:35,594:7696(0x7f8246f4c700):ZOO_ERROR@handle_socket_error_msg@1621: >>> Socket [192.168.1.32:2181] zk retcode=-112, errno=116(Stale NFS file >>> handle): sessionId=0x333cf926ee322ef has expired. >>> 2011-12-16 17:06:35,594:7696(0x7f8246f4c700):ZOO_DEBUG@handle_error@1138: >>> Calling a watcher for a ZOO_SESSION_EVENT and the >>> state=ZOO_EXPIRED_SESSION_STATE >>> 2011-12-16 17:06:35,594:7696(0x7f8246f4c700):ZOO_DEBUG@do_io@317: IO >>> thread terminated >>> 2011-12-16 17:06:35,594:7696(0x7f824674b700):ZOO_DEBUG@process_completions >>> @1765: >>> Calling a watcher for node [], type = -1 event=ZOO_SESSION_EVENT >>> 2011-12-16 17:06:35,594:7696(0x7f824674b700):ZOO_DEBUG@process_completions >>> @1795: >>> Calling COMPLETION_STRINGLIST for xid=0x4eeb07ce rc=-112 >>> E1216 17:06:35.594744 7699 detector.cpp:559] Master detector failed >>> to get masters: session expired >>> I1216 17:06:48.371737 7699 detector.cpp:457] Master detector lost >>> connection to ZooKeeper, attempting to reconnect ... >>> W1216 17:06:48.371817 7699 detector.cpp:500] Master detector >>> ZooKeeper session expired! >>> 2011-12-16 17:06:48,371:7696(0x7f824674b700):ZOO_DEBUG@do_completion@336: >>> completion thread terminated >>> 2011-12-16 17:06:48,371:7696(0x7f824874f700):ZOO_INFO@zookeeper_close >>> @2307: >>> Freeing zookeeper resources for sessionId=0x333cf926ee322ef >>> F1216 17:06:48.374893 7699 master.cpp:475] No longer elected master >>> ... committing suicide! >>> *** Check failure stack trace: *** >>> @ 0x5aa61d google::LogMessage::Fail() >>> @ 0x5ace2f google::LogMessage::SendToLog() >>> @ 0x5aa206 google::LogMessage::Flush() >>> @ 0x5ad6ad google::LogMessageFatal::~LogMessageFatal() >>> @ 0x439951 >>> mesos::internal::master::Master::noMasterDetected() >>> @ 0x440158 mesos::internal::master::Master::operator()() >>> @ 0x5bfc39 process::ProcessManager::run() >>> @ 0x5bfd8a process::trampoline() >>> @ 0x7f8248791160 (unknown) >>> >>> >>> >>> >>> >>> At some times, it can reconnect to ZooKeeper without suicide: >>> >>> >>> 2011-12-16 17:01:28,580:7696(0x7f824674b700):ZOO_WARN@zookeeper_interest >>> @1461: >>> Exceeded deadline by 13ms >>> 2011-12-16 17:01:28,581:7696(0x7f824674b700):ZOO_DEBUG@zookeeper_process >>> @1974: >>> Got ping response in 0 ms >>> 2011-12-16 17:01:31,921:7696(0x7f824674b700):ZOO_DEBUG@zookeeper_process >>> @1974: >>> Got ping response in 0 ms >>> 2011-12-16 17:01:54,649:7696(0x7f824674b700):ZOO_WARN@zookeeper_interest >>> @1461: >>> Exceeded deadline by 19401ms >>> 2011-12-16 >>> 17:01:54,649:7696(0x7f824674b700):ZOO_ERROR@handle_socket_error_msg@1528: >>> Socket [192.168.1.33:2181] zk retcode=-7, errno=110(Connection timed >>> out): connection timed out (exceeded timeout by 16062ms) >>> 2011-12-16 17:01:54,649:7696(0x7f824674b700):ZOO_DEBUG@handle_error@1141: >>> Calling a watcher for a ZOO_SESSION_EVENT and the >>> state=CONNECTING_STATE >>> 2011-12-16 17:01:54,649:7696(0x7f824674b700):ZOO_WARN@zookeeper_interest >>> @1461: >>> Exceeded deadline by 19401ms >>> 2011-12-16 17:01:54,649:7696(0x7f8246f4c700):ZOO_DEBUG@process_completions >>> @1765: >>> Calling a watcher for node [], type = -1 event=ZOO_SESSION_EVENT >>> I1216 17:01:54.657625 7699 detector.cpp:457] Master detector lost >>> connection to ZooKeeper, attempting to reconnect ... >>> 2011-12-16 17:01:54,660:7696(0x7f824674b700):ZOO_INFO@check_events@1585: >>> initiated connection to server [192.168.1.34:2181] >>> 2011-12-16 >>> 17:01:54,662:7696(0x7f824674b700):ZOO_ERROR@handle_socket_error_msg@1621: >>> Socket [192.168.1.34:2181] zk retcode=-112, errno=116(Stale NFS file >>> handle): sessionId=0x333cf926ee322ee has expired. >>> 2011-12-16 17:01:54,662:7696(0x7f824674b700):ZOO_DEBUG@handle_error@1138: >>> Calling a watcher for a ZOO_SESSION_EVENT and the >>> state=ZOO_EXPIRED_SESSION_STATE >>> 2011-12-16 17:01:54,662:7696(0x7f824674b700):ZOO_DEBUG@do_io@317: IO >>> thread terminated >>> 2011-12-16 17:01:54,662:7696(0x7f8246f4c700):ZOO_DEBUG@process_completions >>> @1765: >>> Calling a watcher for node [], type = -1 event=ZOO_SESSION_EVENT >>> W1216 17:01:54.662583 7699 detector.cpp:500] Master detector >>> ZooKeeper session expired! >>> 2011-12-16 17:01:54,662:7696(0x7f8246f4c700):ZOO_DEBUG@do_completion@336: >>> completion thread terminated >>> 2011-12-16 17:01:54,662:7696(0x7f824874f700):ZOO_INFO@zookeeper_close >>> @2307: >>> Freeing zookeeper resources for sessionId=0x333cf926ee322ee >>> >>> 2011-12-16 17:01:54,662:7696(0x7f824874f700):ZOO_INFO@log_env@658: >>> Client environment:zookeeper.version=zookeeper C client 3.3.1 >>> 2011-12-16 17:01:54,662:7696(0x7f824874f700):ZOO_INFO@log_env@662: >>> Client environment:host.name=boromir >>> 2011-12-16 17:01:54,662:7696(0x7f824874f700):ZOO_INFO@log_env@669: >>> Client environment:os.name=Linux >>> 2011-12-16 17:01:54,662:7696(0x7f824874f700):ZOO_INFO@log_env@670: >>> Client environment:os.arch=2.6.39-gentoo-r3 >>> 2011-12-16 17:01:54,662:7696(0x7f824874f700):ZOO_INFO@log_env@671: >>> Client environment:os.version=#4 SMP Mon Sep 19 17:12:59 CST 2011 >>> 2011-12-16 17:01:54,663:7696(0x7f824874f700):ZOO_INFO@log_env@679: >>> Client environment:user.name=(null) >>> 2011-12-16 17:01:54,663:7696(0x7f824874f700):ZOO_INFO@log_env@687: >>> Client environment:user.home=/root >>> 2011-12-16 17:01:54,663:7696(0x7f824874f700):ZOO_INFO@log_env@699: >>> Client environment:user.dir=/opt/mesos/bin >>> 2011-12-16 17:01:54,663:7696(0x7f824874f700):ZOO_INFO@zookeeper_init@727: >>> Initiating client connection, >>> host=zk1:2181,zk2:2181,zk3:2181,zk4:2181,zk5:2181 sessionTimeout=10000 >>> watcher=0x4cb750 sessionId=0 sessionPasswd=<null> >>> context=0x7f823803c380 flags=0 >>> 2011-12-16 17:01:54,663:7696(0x7f824874f700):ZOO_DEBUG@start_threads@152: >>> starting threads... >>> 2011-12-16 17:01:54,680:7696(0x7f824674b700):ZOO_DEBUG@do_completion@326: >>> started completion thread >>> 2011-12-16 17:01:54,680:7696(0x7f8246f4c700):ZOO_DEBUG@do_io@279: >>> started IO thread >>> 2011-12-16 17:01:54,680:7696(0x7f8246f4c700):ZOO_INFO@check_events@1585: >>> initiated connection to server [192.168.1.33:2181] >>> 2011-12-16 17:01:54,719:7696(0x7f8246f4c700):ZOO_INFO@check_events@1632: >>> session establishment complete on server [192.168.1.33:2181], >>> sessionId=0x333cf926ee322ef, negotiated timeout=10000 >>> 2011-12-16 17:01:54,719:7696(0x7f8246f4c700):ZOO_DEBUG@check_events@1638: >>> Calling a watcher for a ZOO_SESSION_EVENT and the >>> state=ZOO_CONNECTED_STATE >>> 2011-12-16 17:01:54,719:7696(0x7f824674b700):ZOO_DEBUG@process_completions >>> @1765: >>> Calling a watcher for node [], type = -1 event=ZOO_SESSION_EVENT >>> I1216 17:01:54.719355 7699 detector.cpp:375] Master detector >>> connected to ZooKeeper ... >>> I1216 17:01:54.719421 7699 detector.cpp:404] Trying to create znode >>> '/mesos_master' in ZooKeeper >>> 2011-12-16 17:01:54,719:7696(0x7f824874f700):ZOO_DEBUG@zoo_acreate@2489: >>> Sending request xid=0x4eeb07c9 for path [/mesos_master] to >>> 192.168.1.33:2181 >>> 2011-12-16 17:01:54,723:7696(0x7f8246f4c700):ZOO_DEBUG@zookeeper_process >>> @1980: >>> Queueing asynchronous response >>> 2011-12-16 17:01:54,723:7696(0x7f824674b700):ZOO_DEBUG@process_completions >>> @1817: >>> Calling COMPLETION_STRING for xid=0x4eeb07c9 rc=-110 >>> 2011-12-16 17:01:54,724:7696(0x7f824874f700):ZOO_DEBUG@zoo_awget@2400: >>> Sending request xid=0x4eeb07ca for path [/mesos_master] to >>> 192.168.1.33:2181 >>> 2011-12-16 17:01:54,724:7696(0x7f8246f4c700):ZOO_DEBUG@zookeeper_process >>> @1980: >>> Queueing asynchronous response >>> 2011-12-16 17:01:54,724:7696(0x7f824674b700):ZOO_DEBUG@process_completions >>> @1772: >>> Calling COMPLETION_DATA for xid=0x4eeb07ca rc=0 >>> 2011-12-16 17:01:54,724:7696(0x7f824874f700):ZOO_DEBUG@zoo_acreate@2489: >>> Sending request xid=0x4eeb07cb for path [/mesos_master/] to >>> 192.168.1.33:2181 >>> 2011-12-16 17:01:54,726:7696(0x7f8246f4c700):ZOO_DEBUG@zookeeper_process >>> @1980: >>> Queueing asynchronous response >>> 2011-12-16 17:01:54,726:7696(0x7f824674b700):ZOO_DEBUG@process_completions >>> @1817: >>> Calling COMPLETION_STRING for xid=0x4eeb07cb rc=0 >>> I1216 17:01:54.726915 7699 detector.cpp:443] Created >>> ephemeral/sequence:0000000344 >>> 2011-12-16 17:01:54,727:7696(0x7f824874f700):ZOO_DEBUG@zoo_awget_children_ >>> @2612: >>> Sending request xid=0x4eeb07cc for path [/mesos_master] to >>> 192.168.1.33:2181 >>> W1216 17:01:54.727063 7699 master.cpp:299] Dropping unknown message >>> 'mesos.internal.GotMasterTokenMessage' from: @0.0.0.0:0 >>> 2011-12-16 17:01:54,727:7696(0x7f8246f4c700):ZOO_DEBUG@zookeeper_process >>> @1980: >>> Queueing asynchronous response >>> 2011-12-16 17:01:54,727:7696(0x7f824674b700):ZOO_DEBUG@process_completions >>> @1795: >>> Calling COMPLETION_STRINGLIST for xid=0x4eeb07cc rc=0 >>> I1216 17:01:54.727489 7699 detector.cpp:562] Master detector found 1 >>> registered masters >>> 2011-12-16 17:01:54,727:7696(0x7f824874f700):ZOO_DEBUG@zoo_awget@2400: >>> Sending request xid=0x4eeb07cd for path [/mesos_master/0000000344] to >>> 192.168.1.33:2181 >>> 2011-12-16 17:01:54,728:7696(0x7f8246f4c700):ZOO_DEBUG@zookeeper_process >>> @1980: >>> Queueing asynchronous response >>> 2011-12-16 17:01:54,728:7696(0x7f824674b700):ZOO_DEBUG@process_completions >>> @1772: >>> Calling COMPLETION_DATA for xid=0x4eeb07cd rc=0 >>> I1216 17:01:54.728397 7699 detector.cpp:592] Master detector got new >>> master pid: [email protected]:5050 >>> I1216 17:01:54.728489 7699 master.cpp:467] Still acting as master! >>> >>> Our Mesos is lastest master branch at github, ZooKeeper is 3.3.3, but >>> the client is the included one, which is 3.3.1, >>> is it the reason ? >>> >>> One week before, we use Mesos 0.3-alpha, it's more stable with >>> ZooKeeper than now. >>> >>> Any suggestion are welcomed, thank you. >>> >>> -- >>> - Davies >>> > > > > -- > - Davies -- - Davies
