[
https://issues.apache.org/jira/browse/MESOS-8703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16408442#comment-16408442
]
Benno Evers commented on MESOS-8703:
------------------------------------
The original zookeeper crash might well be caused by MESOS-8550.
However, usually this should just result in a crash and subsequent restart of
the master. Instead, the master seems to lock up during shutdown. The cause
might be a similar issue as in MESOS-1477, although I couldn't see any
suspicious changes to the related files for version 1.4.1.
If this issue is somewhat reproducible, it would probably be helpful to include
stack traces for all threads when the master becomes unresponsive.
> Mesos master can`t reconnect to zookeeper
> ------------------------------------------
>
> Key: MESOS-8703
> URL: https://issues.apache.org/jira/browse/MESOS-8703
> Project: Mesos
> Issue Type: Bug
> Components: master
> Affects Versions: 1.4.1
> Reporter: Anton Malevich
> Priority: Blocker
>
> Mesos master can`t reconnect to zookeeper after zookeeper hangs.
> {noformat}
> 2018-03-20
> 10:16:45,608:1(0x2ae675db6700):ZOO_ERROR@handle_socket_error_msg@1666: Socket
> [<zknode1>:2181] zk retcode=-7, errno=110(Connection timed out): connection
> to <zknode1>:2181 timed out (exceeded timeout by 3ms)
> 2018-03-20 10:16:45,609:1(0x2ae675db6700):ZOO_INFO@check_events@1728:
> initiated connection to server [<zknode2>:2181]
> 2018-03-20
> 10:16:45,619:1(0x2ae675db6700):ZOO_ERROR@handle_socket_error_msg@1764: Socket
> [<zknode2>:2181] zk retcode=-112, errno=116(Stale file handle):
> sessionId=0x5623d0e483dd435 has expired.
> I0320 10:16:45.620604 18 group.cpp:511] ZooKeeper session expired
> I0320 10:16:45.620802 16 detector.cpp:152] Detected a new leader: None
> I0320 10:16:45.620957 16 master.cpp:2176] The newly elected leader is None
> mesos-master: ../../3rdparty/stout/include/stout/option.hpp:112: T&
> Option<T>::get() & [with T = mesos::MasterInfo]: Assertion `isSome()' failed.
> *** Aborted at 1521541005 (unix time) try "date -d @1521541005" if you are
> using GNU date ***
> PC: @ 0x2ae63d2b9428 (unknown)
> *** SIGABRT (@0x1) received by PID 1 (TID 0x2ae648ffa700) from PID 1; stack
> trace: ***
> @ 0x2ae63d078390 (unknown)
> @ 0x2ae63d2b9428 (unknown)
> @ 0x2ae63d2bb02a (unknown)
> @ 0x2ae63d2b1bd7 (unknown)
> @ 0x2ae63d2b1c82 (unknown)
> 2018-03-20 10:16:45,622:1(0x2ae649ffc700):ZOO_INFO@zookeeper_close@2543:
> Freeing zookeeper resources for sessionId=0x5623d0e483dd435
> 2018-03-20 10:16:45,623:1(0x2ae6477f7700):ZOO_INFO@log_env@726: Client
> environment:zookeeper.version=zookeeper C client 3.4.8
> 2018-03-20 10:16:45,623:1(0x2ae6477f7700):ZOO_INFO@log_env@730: Client
> environment:host.name=<mesos_hostname>
> 2018-03-20 10:16:45,623:1(0x2ae6477f7700):ZOO_INFO@log_env@737: Client
> environment:os.name=Linux
> 2018-03-20 10:16:45,623:1(0x2ae6477f7700):ZOO_INFO@log_env@738: Client
> environment:os.arch=4.8.15-1.el7.wg.x86_64
> 2018-03-20 10:16:45,623:1(0x2ae6477f7700):ZOO_INFO@log_env@739: Client
> environment:os.version=#1 SMP Mon Dec 26 14:34:45 UTC 2016
> 2018-03-20 10:16:45,624:1(0x2ae6477f7700):ZOO_INFO@log_env@747: Client
> environment:user.name=(null)
> 2018-03-20 10:16:45,624:1(0x2ae6477f7700):ZOO_INFO@log_env@755: Client
> environment:user.home=/root
> 2018-03-20 10:16:45,624:1(0x2ae6477f7700):ZOO_INFO@log_env@767: Client
> environment:user.dir=/
> 2018-03-20 10:16:45,624:1(0x2ae6477f7700):ZOO_INFO@zookeeper_init@800:
> Initiating client connection, host=<zk_pool> sessionTimeout=10000
> watcher=0x2ae63b3711e0 sessionId=0 sessionPasswd=<null>
> context=0x2ae6900036f8 flags=0
> @ 0x2ae63ad6b55b mesos::internal::master::Master::detected()
> @ 0x2ae63b9e4cfc process::ProcessBase::visit()
> 2018-03-20 10:16:45,634:1(0x2ae6765b7700):ZOO_INFO@check_events@1728:
> initiated connection to server [<zknode1>:2181]
> @ 0x2ae63b9fac84 process::ProcessManager::resume()
> @ 0x2ae63b9fd5e6
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
> @ 0x2ae63c87ec80 (unknown)
> @ 0x2ae63d06e6ba start_thread
> @ 0x2ae63d38b3dd (unknown)
> 2018-03-20 10:16:45,651:1(0x2ae6765b7700):ZOO_INFO@check_events@1775: session
> establishment complete on server [<zknode1>:2181],
> sessionId=0x1623f43348692c7, negotiated timeout=10000
> I0320 10:16:45.651684 15 group.cpp:341] Group process
> (zookeeper-group(2)@<mesos4>:5050) connected to ZooKeeper
> I0320 10:16:45.651733 15 group.cpp:831] Syncing group operations: queue
> size (joins, cancels, datas) = (0, 0, 0)
> I0320 10:16:45.651743 15 group.cpp:419] Trying to create path '/mesos' in
> ZooKeeper
> I0320 10:16:45.676736 15 detector.cpp:152] Detected a new leader:
> (id='704')
> I0320 10:16:45.676844 15 group.cpp:700] Trying to get
> '/mesos/json.info_0000000704' in ZooKeeper
> I0320 10:16:45.683346 15 zookeeper.cpp:262] A new leading master
> (UPID=master@<mesos4>:5050) is detected
> {noformat}
> After this, mesos master do not answer for http requests, and leader election
> do not happens.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)