I have found the reason causing zookeeper session expiration,
the logs/ located at a very slow disk, logging operation block the
master process to send ping to zookeeper.
then we move the logs/ into /dev/shm to solve the problem.

On Sun, Dec 18, 2011 at 10:01 AM, Davies Liu <[email protected]> wrote:
> Thank you for the reply.
>
> We run master under deamon-tools, so it can restart after shutdown.
> When it restart, some msg from master or slaves may be lost,
> then some not finished application will hang, waiting for tasks to
> finish for ever.
>
> If the master do not restart so many times, the apps will run more stable.
>
> Why the session expiration occurs in ZooKeeper ? It was expected or not ?
>
> Davies
>
> On Sun, Dec 18, 2011 at 2:01 AM, Benjamin Hindman
> <[email protected]> wrote:
>> Currently, the master is programmed to shutdown if a session expiration
>> occurs. This could be considered a bug, because when only using one master
>> once ZooKeeper has sorted itself out that master will obviously get
>> re-elected (translation: I'd be happy if you filed a bug). In the mean
>> time, most HA deployments of Mesos have used multiple masters (either on
>> separate machines or not). When one master kills itself another takes over
>> and the cluster continues to operate as normal.
>>
>> Hope that helps!
>>
>>
>> On Fri, Dec 16, 2011 at 2:03 AM, Davies Liu <[email protected]> wrote:
>>
>>> Hi devs:
>>>
>>> Our Mesos master suicide after ZooKeeper session expired several times
>>> in a day with the following msg:
>>>
>>>
>>> 2011-12-16 17:06:35,594:7696(0x7f824874f700):ZOO_DEBUG@zoo_awget_children_
>>> @2612:
>>> Sending request xid=0x4eeb07ce for path [/mesos_master] to
>>> 192.168.1.32:2181
>>> 2011-12-16
>>> 17:06:35,594:7696(0x7f8246f4c700):ZOO_ERROR@handle_socket_error_msg@1621:
>>> Socket [192.168.1.32:2181] zk retcode=-112, errno=116(Stale NFS file
>>> handle): sessionId=0x333cf926ee322ef has expired.
>>> 2011-12-16 17:06:35,594:7696(0x7f8246f4c700):ZOO_DEBUG@handle_error@1138:
>>> Calling a watcher for a ZOO_SESSION_EVENT and the
>>> state=ZOO_EXPIRED_SESSION_STATE
>>> 2011-12-16 17:06:35,594:7696(0x7f8246f4c700):ZOO_DEBUG@do_io@317: IO
>>> thread terminated
>>> 2011-12-16 17:06:35,594:7696(0x7f824674b700):ZOO_DEBUG@process_completions
>>> @1765:
>>> Calling a watcher for node [], type = -1 event=ZOO_SESSION_EVENT
>>> 2011-12-16 17:06:35,594:7696(0x7f824674b700):ZOO_DEBUG@process_completions
>>> @1795:
>>> Calling COMPLETION_STRINGLIST for xid=0x4eeb07ce rc=-112
>>> E1216 17:06:35.594744  7699 detector.cpp:559] Master detector failed
>>> to get masters: session expired
>>> I1216 17:06:48.371737  7699 detector.cpp:457] Master detector lost
>>> connection to ZooKeeper, attempting to reconnect ...
>>> W1216 17:06:48.371817  7699 detector.cpp:500] Master detector
>>> ZooKeeper session expired!
>>> 2011-12-16 17:06:48,371:7696(0x7f824674b700):ZOO_DEBUG@do_completion@336:
>>> completion thread terminated
>>> 2011-12-16 17:06:48,371:7696(0x7f824874f700):ZOO_INFO@zookeeper_close
>>> @2307:
>>> Freeing zookeeper resources for sessionId=0x333cf926ee322ef
>>> F1216 17:06:48.374893  7699 master.cpp:475] No longer elected master
>>> ... committing suicide!
>>> *** Check failure stack trace: ***
>>>    @           0x5aa61d  google::LogMessage::Fail()
>>>    @           0x5ace2f  google::LogMessage::SendToLog()
>>>    @           0x5aa206  google::LogMessage::Flush()
>>>    @           0x5ad6ad  google::LogMessageFatal::~LogMessageFatal()
>>>    @           0x439951
>>>  mesos::internal::master::Master::noMasterDetected()
>>>    @           0x440158  mesos::internal::master::Master::operator()()
>>>    @           0x5bfc39  process::ProcessManager::run()
>>>    @           0x5bfd8a  process::trampoline()
>>>    @     0x7f8248791160  (unknown)
>>>
>>>
>>>
>>>
>>>
>>> At some times, it can reconnect to ZooKeeper without suicide:
>>>
>>>
>>> 2011-12-16 17:01:28,580:7696(0x7f824674b700):ZOO_WARN@zookeeper_interest
>>> @1461:
>>> Exceeded deadline by 13ms
>>> 2011-12-16 17:01:28,581:7696(0x7f824674b700):ZOO_DEBUG@zookeeper_process
>>> @1974:
>>> Got ping response in 0 ms
>>> 2011-12-16 17:01:31,921:7696(0x7f824674b700):ZOO_DEBUG@zookeeper_process
>>> @1974:
>>> Got ping response in 0 ms
>>> 2011-12-16 17:01:54,649:7696(0x7f824674b700):ZOO_WARN@zookeeper_interest
>>> @1461:
>>> Exceeded deadline by 19401ms
>>> 2011-12-16
>>> 17:01:54,649:7696(0x7f824674b700):ZOO_ERROR@handle_socket_error_msg@1528:
>>> Socket [192.168.1.33:2181] zk retcode=-7, errno=110(Connection timed
>>> out): connection timed out (exceeded timeout by 16062ms)
>>> 2011-12-16 17:01:54,649:7696(0x7f824674b700):ZOO_DEBUG@handle_error@1141:
>>> Calling a watcher for a ZOO_SESSION_EVENT and the
>>> state=CONNECTING_STATE
>>> 2011-12-16 17:01:54,649:7696(0x7f824674b700):ZOO_WARN@zookeeper_interest
>>> @1461:
>>> Exceeded deadline by 19401ms
>>> 2011-12-16 17:01:54,649:7696(0x7f8246f4c700):ZOO_DEBUG@process_completions
>>> @1765:
>>> Calling a watcher for node [], type = -1 event=ZOO_SESSION_EVENT
>>> I1216 17:01:54.657625  7699 detector.cpp:457] Master detector lost
>>> connection to ZooKeeper, attempting to reconnect ...
>>> 2011-12-16 17:01:54,660:7696(0x7f824674b700):ZOO_INFO@check_events@1585:
>>> initiated connection to server [192.168.1.34:2181]
>>> 2011-12-16
>>> 17:01:54,662:7696(0x7f824674b700):ZOO_ERROR@handle_socket_error_msg@1621:
>>> Socket [192.168.1.34:2181] zk retcode=-112, errno=116(Stale NFS file
>>> handle): sessionId=0x333cf926ee322ee has expired.
>>> 2011-12-16 17:01:54,662:7696(0x7f824674b700):ZOO_DEBUG@handle_error@1138:
>>> Calling a watcher for a ZOO_SESSION_EVENT and the
>>> state=ZOO_EXPIRED_SESSION_STATE
>>> 2011-12-16 17:01:54,662:7696(0x7f824674b700):ZOO_DEBUG@do_io@317: IO
>>> thread terminated
>>> 2011-12-16 17:01:54,662:7696(0x7f8246f4c700):ZOO_DEBUG@process_completions
>>> @1765:
>>> Calling a watcher for node [], type = -1 event=ZOO_SESSION_EVENT
>>> W1216 17:01:54.662583  7699 detector.cpp:500] Master detector
>>> ZooKeeper session expired!
>>> 2011-12-16 17:01:54,662:7696(0x7f8246f4c700):ZOO_DEBUG@do_completion@336:
>>> completion thread terminated
>>> 2011-12-16 17:01:54,662:7696(0x7f824874f700):ZOO_INFO@zookeeper_close
>>> @2307:
>>> Freeing zookeeper resources for sessionId=0x333cf926ee322ee
>>>
>>> 2011-12-16 17:01:54,662:7696(0x7f824874f700):ZOO_INFO@log_env@658:
>>> Client environment:zookeeper.version=zookeeper C client 3.3.1
>>> 2011-12-16 17:01:54,662:7696(0x7f824874f700):ZOO_INFO@log_env@662:
>>> Client environment:host.name=boromir
>>> 2011-12-16 17:01:54,662:7696(0x7f824874f700):ZOO_INFO@log_env@669:
>>> Client environment:os.name=Linux
>>> 2011-12-16 17:01:54,662:7696(0x7f824874f700):ZOO_INFO@log_env@670:
>>> Client environment:os.arch=2.6.39-gentoo-r3
>>> 2011-12-16 17:01:54,662:7696(0x7f824874f700):ZOO_INFO@log_env@671:
>>> Client environment:os.version=#4 SMP Mon Sep 19 17:12:59 CST 2011
>>> 2011-12-16 17:01:54,663:7696(0x7f824874f700):ZOO_INFO@log_env@679:
>>> Client environment:user.name=(null)
>>> 2011-12-16 17:01:54,663:7696(0x7f824874f700):ZOO_INFO@log_env@687:
>>> Client environment:user.home=/root
>>> 2011-12-16 17:01:54,663:7696(0x7f824874f700):ZOO_INFO@log_env@699:
>>> Client environment:user.dir=/opt/mesos/bin
>>> 2011-12-16 17:01:54,663:7696(0x7f824874f700):ZOO_INFO@zookeeper_init@727:
>>> Initiating client connection,
>>> host=zk1:2181,zk2:2181,zk3:2181,zk4:2181,zk5:2181 sessionTimeout=10000
>>> watcher=0x4cb750 sessionId=0 sessionPasswd=<null>
>>> context=0x7f823803c380 flags=0
>>> 2011-12-16 17:01:54,663:7696(0x7f824874f700):ZOO_DEBUG@start_threads@152:
>>> starting threads...
>>> 2011-12-16 17:01:54,680:7696(0x7f824674b700):ZOO_DEBUG@do_completion@326:
>>> started completion thread
>>> 2011-12-16 17:01:54,680:7696(0x7f8246f4c700):ZOO_DEBUG@do_io@279:
>>> started IO thread
>>> 2011-12-16 17:01:54,680:7696(0x7f8246f4c700):ZOO_INFO@check_events@1585:
>>> initiated connection to server [192.168.1.33:2181]
>>> 2011-12-16 17:01:54,719:7696(0x7f8246f4c700):ZOO_INFO@check_events@1632:
>>> session establishment complete on server [192.168.1.33:2181],
>>> sessionId=0x333cf926ee322ef, negotiated timeout=10000
>>> 2011-12-16 17:01:54,719:7696(0x7f8246f4c700):ZOO_DEBUG@check_events@1638:
>>> Calling a watcher for a ZOO_SESSION_EVENT and the
>>> state=ZOO_CONNECTED_STATE
>>> 2011-12-16 17:01:54,719:7696(0x7f824674b700):ZOO_DEBUG@process_completions
>>> @1765:
>>> Calling a watcher for node [], type = -1 event=ZOO_SESSION_EVENT
>>> I1216 17:01:54.719355  7699 detector.cpp:375] Master detector
>>> connected to ZooKeeper ...
>>> I1216 17:01:54.719421  7699 detector.cpp:404] Trying to create znode
>>> '/mesos_master' in ZooKeeper
>>> 2011-12-16 17:01:54,719:7696(0x7f824874f700):ZOO_DEBUG@zoo_acreate@2489:
>>> Sending request xid=0x4eeb07c9 for path [/mesos_master] to
>>> 192.168.1.33:2181
>>> 2011-12-16 17:01:54,723:7696(0x7f8246f4c700):ZOO_DEBUG@zookeeper_process
>>> @1980:
>>> Queueing asynchronous response
>>> 2011-12-16 17:01:54,723:7696(0x7f824674b700):ZOO_DEBUG@process_completions
>>> @1817:
>>> Calling COMPLETION_STRING for xid=0x4eeb07c9 rc=-110
>>> 2011-12-16 17:01:54,724:7696(0x7f824874f700):ZOO_DEBUG@zoo_awget@2400:
>>> Sending request xid=0x4eeb07ca for path [/mesos_master] to
>>> 192.168.1.33:2181
>>> 2011-12-16 17:01:54,724:7696(0x7f8246f4c700):ZOO_DEBUG@zookeeper_process
>>> @1980:
>>> Queueing asynchronous response
>>> 2011-12-16 17:01:54,724:7696(0x7f824674b700):ZOO_DEBUG@process_completions
>>> @1772:
>>> Calling COMPLETION_DATA for xid=0x4eeb07ca rc=0
>>> 2011-12-16 17:01:54,724:7696(0x7f824874f700):ZOO_DEBUG@zoo_acreate@2489:
>>> Sending request xid=0x4eeb07cb for path [/mesos_master/] to
>>> 192.168.1.33:2181
>>> 2011-12-16 17:01:54,726:7696(0x7f8246f4c700):ZOO_DEBUG@zookeeper_process
>>> @1980:
>>> Queueing asynchronous response
>>> 2011-12-16 17:01:54,726:7696(0x7f824674b700):ZOO_DEBUG@process_completions
>>> @1817:
>>> Calling COMPLETION_STRING for xid=0x4eeb07cb rc=0
>>> I1216 17:01:54.726915  7699 detector.cpp:443] Created
>>> ephemeral/sequence:0000000344
>>> 2011-12-16 17:01:54,727:7696(0x7f824874f700):ZOO_DEBUG@zoo_awget_children_
>>> @2612:
>>> Sending request xid=0x4eeb07cc for path [/mesos_master] to
>>> 192.168.1.33:2181
>>> W1216 17:01:54.727063  7699 master.cpp:299] Dropping unknown message
>>> 'mesos.internal.GotMasterTokenMessage' from: @0.0.0.0:0
>>> 2011-12-16 17:01:54,727:7696(0x7f8246f4c700):ZOO_DEBUG@zookeeper_process
>>> @1980:
>>> Queueing asynchronous response
>>> 2011-12-16 17:01:54,727:7696(0x7f824674b700):ZOO_DEBUG@process_completions
>>> @1795:
>>> Calling COMPLETION_STRINGLIST for xid=0x4eeb07cc rc=0
>>> I1216 17:01:54.727489  7699 detector.cpp:562] Master detector found 1
>>> registered masters
>>> 2011-12-16 17:01:54,727:7696(0x7f824874f700):ZOO_DEBUG@zoo_awget@2400:
>>> Sending request xid=0x4eeb07cd for path [/mesos_master/0000000344] to
>>> 192.168.1.33:2181
>>> 2011-12-16 17:01:54,728:7696(0x7f8246f4c700):ZOO_DEBUG@zookeeper_process
>>> @1980:
>>> Queueing asynchronous response
>>> 2011-12-16 17:01:54,728:7696(0x7f824674b700):ZOO_DEBUG@process_completions
>>> @1772:
>>> Calling COMPLETION_DATA for xid=0x4eeb07cd rc=0
>>> I1216 17:01:54.728397  7699 detector.cpp:592] Master detector got new
>>> master pid: [email protected]:5050
>>> I1216 17:01:54.728489  7699 master.cpp:467] Still acting as master!
>>>
>>> Our Mesos is lastest master branch at github, ZooKeeper is 3.3.3, but
>>> the client is the included one, which is 3.3.1,
>>> is it the reason ?
>>>
>>> One week before, we use Mesos 0.3-alpha, it's more stable with
>>> ZooKeeper than now.
>>>
>>> Any suggestion are welcomed, thank you.
>>>
>>> --
>>>  - Davies
>>>
>
>
>
> --
>  - Davies



-- 
 - Davies

Reply via email to