[ 
https://issues.apache.org/jira/browse/MESOS-8703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16412190#comment-16412190
 ] 

Benjamin Mahler commented on MESOS-8703:
----------------------------------------

[~bennoe] should this be closed and marked as fixed in 1.4.2 per the fix you 
posted? We can track the deadlock investigation separately if [~Lomonosow] is 
able to provide stack traces.

 

> Mesos master can`t reconnect to zookeeper 
> ------------------------------------------
>
>                 Key: MESOS-8703
>                 URL: https://issues.apache.org/jira/browse/MESOS-8703
>             Project: Mesos
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 1.4.1
>            Reporter: Anton Malevich
>            Priority: Blocker
>
> Mesos master can`t reconnect to zookeeper after zookeeper hangs.
> {noformat}
> 2018-03-20 
> 10:16:45,608:1(0x2ae675db6700):ZOO_ERROR@handle_socket_error_msg@1666: Socket 
> [<zknode1>:2181] zk retcode=-7, errno=110(Connection timed out): connection 
> to <zknode1>:2181 timed out (exceeded timeout by 3ms)
> 2018-03-20 10:16:45,609:1(0x2ae675db6700):ZOO_INFO@check_events@1728: 
> initiated connection to server [<zknode2>:2181]
> 2018-03-20 
> 10:16:45,619:1(0x2ae675db6700):ZOO_ERROR@handle_socket_error_msg@1764: Socket 
> [<zknode2>:2181] zk retcode=-112, errno=116(Stale file handle): 
> sessionId=0x5623d0e483dd435 has expired.
> I0320 10:16:45.620604    18 group.cpp:511] ZooKeeper session expired
> I0320 10:16:45.620802    16 detector.cpp:152] Detected a new leader: None
> I0320 10:16:45.620957    16 master.cpp:2176] The newly elected leader is None
> mesos-master: ../../3rdparty/stout/include/stout/option.hpp:112: T& 
> Option<T>::get() & [with T = mesos::MasterInfo]: Assertion `isSome()' failed.
> *** Aborted at 1521541005 (unix time) try "date -d @1521541005" if you are 
> using GNU date ***
> PC: @     0x2ae63d2b9428 (unknown)
> *** SIGABRT (@0x1) received by PID 1 (TID 0x2ae648ffa700) from PID 1; stack 
> trace: ***
>     @     0x2ae63d078390 (unknown)
>     @     0x2ae63d2b9428 (unknown)
>     @     0x2ae63d2bb02a (unknown)
>     @     0x2ae63d2b1bd7 (unknown)
>     @     0x2ae63d2b1c82 (unknown)
> 2018-03-20 10:16:45,622:1(0x2ae649ffc700):ZOO_INFO@zookeeper_close@2543: 
> Freeing zookeeper resources for sessionId=0x5623d0e483dd435
> 2018-03-20 10:16:45,623:1(0x2ae6477f7700):ZOO_INFO@log_env@726: Client 
> environment:zookeeper.version=zookeeper C client 3.4.8
> 2018-03-20 10:16:45,623:1(0x2ae6477f7700):ZOO_INFO@log_env@730: Client 
> environment:host.name=<mesos_hostname>
> 2018-03-20 10:16:45,623:1(0x2ae6477f7700):ZOO_INFO@log_env@737: Client 
> environment:os.name=Linux
> 2018-03-20 10:16:45,623:1(0x2ae6477f7700):ZOO_INFO@log_env@738: Client 
> environment:os.arch=4.8.15-1.el7.wg.x86_64
> 2018-03-20 10:16:45,623:1(0x2ae6477f7700):ZOO_INFO@log_env@739: Client 
> environment:os.version=#1 SMP Mon Dec 26 14:34:45 UTC 2016
> 2018-03-20 10:16:45,624:1(0x2ae6477f7700):ZOO_INFO@log_env@747: Client 
> environment:user.name=(null)
> 2018-03-20 10:16:45,624:1(0x2ae6477f7700):ZOO_INFO@log_env@755: Client 
> environment:user.home=/root
> 2018-03-20 10:16:45,624:1(0x2ae6477f7700):ZOO_INFO@log_env@767: Client 
> environment:user.dir=/
> 2018-03-20 10:16:45,624:1(0x2ae6477f7700):ZOO_INFO@zookeeper_init@800: 
> Initiating client connection, host=<zk_pool> sessionTimeout=10000 
> watcher=0x2ae63b3711e0 sessionId=0 sessionPasswd=<null> 
> context=0x2ae6900036f8 flags=0
>     @     0x2ae63ad6b55b mesos::internal::master::Master::detected()
>     @     0x2ae63b9e4cfc process::ProcessBase::visit()
> 2018-03-20 10:16:45,634:1(0x2ae6765b7700):ZOO_INFO@check_events@1728: 
> initiated connection to server [<zknode1>:2181]
>     @     0x2ae63b9fac84 process::ProcessManager::resume()
>     @     0x2ae63b9fd5e6 
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
>     @     0x2ae63c87ec80 (unknown)
>     @     0x2ae63d06e6ba start_thread
>     @     0x2ae63d38b3dd (unknown)
> 2018-03-20 10:16:45,651:1(0x2ae6765b7700):ZOO_INFO@check_events@1775: session 
> establishment complete on server [<zknode1>:2181], 
> sessionId=0x1623f43348692c7, negotiated timeout=10000
> I0320 10:16:45.651684    15 group.cpp:341] Group process 
> (zookeeper-group(2)@<mesos4>:5050) connected to ZooKeeper
> I0320 10:16:45.651733    15 group.cpp:831] Syncing group operations: queue 
> size (joins, cancels, datas) = (0, 0, 0)
> I0320 10:16:45.651743    15 group.cpp:419] Trying to create path '/mesos' in 
> ZooKeeper
> I0320 10:16:45.676736    15 detector.cpp:152] Detected a new leader: 
> (id='704')
> I0320 10:16:45.676844    15 group.cpp:700] Trying to get 
> '/mesos/json.info_0000000704' in ZooKeeper
> I0320 10:16:45.683346    15 zookeeper.cpp:262] A new leading master 
> (UPID=master@<mesos4>:5050) is detected
> {noformat}
> After this, mesos master do not answer for http requests, and leader election 
> do not happens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to