[ 
https://issues.apache.org/jira/browse/MESOS-3595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14945847#comment-14945847
 ] 

Benjamin Mahler commented on MESOS-3595:
----------------------------------------

The ZooKeeper interface is currently blocking for historical reasons, it should 
be asynchronous (see 
[here|https://issues.apache.org/jira/browse/MESOS-2451?focusedCommentId=14351288&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14351288]).
 Would rather see that get fixed than adding in the workaround you proposed, 
since this issue is present for local runs only. By the way, it would be great 
to clarify that this is only relevant for local testing when the frameworks are 
embedded in the same process as the master, otherwise this would be a critical 
bug.

> Framework process hangs after master failover when number frameworks > 
> libprocess thread pool size
> --------------------------------------------------------------------------------------------------
>
>                 Key: MESOS-3595
>                 URL: https://issues.apache.org/jira/browse/MESOS-3595
>             Project: Mesos
>          Issue Type: Bug
>          Components: scheduler driver
>    Affects Versions: 0.24.1
>            Reporter: Mandeep Chadha
>            Assignee: Mandeep Chadha
>
> If the number of framework created exceeds the lib process threads then 
> during master failover the zookeeper updates can
> cause deadlock. E.g. On a machine with 24 cpus, if the framework count 
> exceeds 24 then when the master fails over all the libprocess threads block 
> updating the cache ( GroupProcess) leading to deadlock. Below is the stack 
> trace of one the libprocess thread :
> {code}
> Thread 101 (Thread 0x7f42821f1700 (LWP 5974)):
> #0  0x000000314100b5bc in pthread_cond_wait@@GLIBC_2.3.2 () from 
> /lib64/libpthread.so.0
> #1  0x00007f42870d1637 in Gate::arrive(long) () from 
> /Users/mchadha/venv/lib/python2.7/site-packages/mesos.native-0.22.1003-py2.7-linux-x86_64.egg/mesos/native/_mesos.so
> #2  0x00007f42870be87c in process::ProcessManager::wait(process::UPID const&) 
> () from 
> /Users/mchadha/venv/lib/python2.7/site-packages/mesos.native-0.22.1003-py2.7-linux-x86_64.eg
> g/mesos/native/_mesos.so
> #3  0x00007f42870c25f7 in process::wait(process::UPID const&, Duration 
> const&) () from 
> /Users/mchadha/venv/lib/python2.7/site-packages/mesos.native-0.22.1003-py2.7-linux-x86_64.e
> gg/mesos/native/_mesos.so
> #4  0x00007f428708e294 in process::Latch::await(Duration const&) () from 
> /Users/mchadha/venv/lib/python2.7/site-packages/mesos.native-0.22.1003-py2.7-linux-x86_64.egg/mesos/nativ
> e/_mesos.so
> #5  0x00007f4286b67dea in process::Future<int>::await(Duration const&) const 
> () from 
> /Users/mchadha/venv/lib/python2.7/site-packages/mesos.native-0.22.1003-py2.7-linux-x86_64.egg
> /mesos/native/_mesos.so
> #6  0x00007f4286b5a0df in process::Future<int>::get() const () from 
> /Users/mchadha/venv/lib/python2.7/site-packages/mesos.native-0.22.1003-py2.7-linux-x86_64.egg/mesos/native/_me
> sos.so
> #7  0x00007f4286ff0508 in ZooKeeper::getChildren(std::basic_string<char, 
> std::char_traits<char>, std::allocator<char> > const&, bool, 
> std::vector<std::basic_string<char, std::cha
> r_traits<char>, std::allocator<char> >, 
> std::allocator<std::basic_string<char, std::char_traits<char>, 
> std::allocator<char> > > >*) () from /Users/mchadha/venv/lib/python2.7/site
> -packages/mesos.native-0.22.1003-py2.7-linux-x86_64.egg/mesos/native/_mesos.so
> #8  0x00007f4286cb394e in zookeeper::GroupProcess::cache() () from 
> /Users/mchadha/venv/lib/python2.7/site-packages/mesos.native-0.22.1003-py2.7-linux-x86_64.egg/mesos/native/_mes
> os.so
> #9  0x00007f4286cb1e63 in zookeeper::GroupProcess::updated(long, 
> std::basic_string<char, std::char_traits<char>, std::allocator<char> > 
> const&) () from /Users/mchadha/venv/lib/py
> thon2.7/site-packages/mesos.native-0.22.1003-py2.7-linux-x86_64.egg/mesos/native/_mesos.so
> #10 0x00007f4286ce027a in std::tr1::_Mem_fn<void 
> (zookeeper::GroupProcess::*)(long, std::basic_string<char, 
> std::char_traits<char>, std::allocator<char> > const&)>::operator()(zo
> okeeper::GroupProcess*, long, std::basic_string<char, std::char_traits<char>, 
> std::allocator<char> > const&) const () from 
> /Users/mchadha/venv/lib/python2.7/site-packages/mesos.n
> ative-0.22.1003-py2.7-linux-x86_64.egg/mesos/native/_mesos.so
> #11 0x00007f4286ce0067 in std::tr1::result_of<std::tr1::_Mem_fn<void 
> (zookeeper::GroupProcess::*)(long, std::basic_string<char, 
> std::char_traits<char>, std::allocator<char> > con
> st&)> ()(std::tr1::result_of<std::tr1::_Mu<std::tr1::_Placeholder<1>, false, 
> true> ()(std::tr1::_Placeholder<1>, 
> std::tr1::tuple<zookeeper::GroupProcess*&>)>::type, std::tr1::res
> ult_of<std::tr1::_Mu<long, false, false> ()(long, 
> std::tr1::_Mu<std::tr1::_Placeholder<1>, false, true> 
> ()(std::tr1::_Placeholder<1>, std::tr1::tuple<zookeeper::GroupProcess*&>))
> >::type, std::tr1::result_of<std::tr1::_Mu<std::basic_string<char, 
> >std::char_traits<char>, std::allocator<char> >, false, false> 
> >()(std::basic_string<char, std::char_traits<char>
> , std::allocator<char> >, std::tr1::_Mu<std::tr1::_Placeholder<1>, false, 
> true> ()(std::tr1::_Placeholder<1>, 
> std::tr1::tuple<zookeeper::GroupProcess*&>))>::type)>::type std::tr1
> ::_Bind<std::tr1::_Mem_fn<void (zookeeper::GroupProcess::*)(long, 
> std::basic_string<char, std::char_traits<char>, std::allocator<char> > 
> const&)> ()(std::tr1::_Placeholder<1>, lo
> ng, std::basic_string<char, std::char_traits<char>, std::allocator<char> 
> >)>::__call<zookeeper::GroupProcess*&, 0, 1, 
> 2>(std::tr1::_Mu<std::tr1::_Placeholder<1>, false, true> ( c
> onst&)(std::tr1::_Placeholder<1>, 
> std::tr1::tuple<zookeeper::GroupProcess*&>), std::tr1::_Index_tuple<0, 1, 2>) 
> () from /Users/mchadha/venv/lib/python2.7/site-packages/mesos.nati
> ve-0.22.1003-py2.7-linux-x86_64.egg/mesos/native/_mesos.so
> #12 0x00007f4286cdfd16 in std::tr1::result_of<std::tr1::_Mem_fn<void 
> (zookeeper::GroupProcess::*)(long, std::basic_string<char, 
> std::char_traits<char>, std::allocator<char> > con
> st&)> ()(std::tr1::result_of<std::tr1::_Mu<std::tr1::_Placeholder<1>, false, 
> true> ()(std::tr1::_Placeholder<1>, 
> std::tr1::tuple<zookeeper::GroupProcess*>)>::type, std::tr1::resu
> lt_of<std::tr1::_Mu<long, false, false> ()(long, 
> std::tr1::_Mu<std::tr1::_Placeholder<1>, false, true> 
> ()(std::tr1::_Placeholder<1>, std::tr1::tuple<zookeeper::GroupProcess*>))>:
> :type, std::tr1::result_of<std::tr1::_Mu<std::basic_string<char, 
> std::char_traits<char>, std::allocator<char> >, false, false> 
> ()(std::basic_string<char, std::char_traits<char>,
> std::allocator<char> >, std::tr1::_Mu<std::tr1::_Placeholder<1>, false, true> 
> ()(std::tr1::_Placeholder<1>, 
> std::tr1::tuple<zookeeper::GroupProcess*>))>::type)>::type std::tr1::_
> Bind<std::tr1::_Mem_fn<void (zookeeper::GroupProcess::*)(long, 
> std::basic_string<char, std::char_traits<char>, std::allocator<char> > 
> const&)> ()(std::tr1::_Placeholder<1>, long,
>  std::basic_string<char, std::char_traits<char>, std::allocator<char> 
> >)>::operator()<zookeeper::GroupProcess*>(zookeeper::GroupProcess*&) () from 
> /Users/mchadha/venv/lib/python2
> .7/site-packages/mesos.native-0.22.1003-py2.7-linux-x86_64.egg/mesos/native/_mesos.so
> #13 0x00007f4286cdf8be in std::tr1::_Function_handler<void 
> ()(zookeeper::GroupProcess*), std::tr1::_Bind<std::tr1::_Mem_fn<void 
> (zookeeper::GroupProcess::*)(long, std::basic_stri
> ng<char, std::char_traits<char>, std::allocator<char> > const&)> 
> ()(std::tr1::_Placeholder<1>, long, std::basic_string<char, 
> std::char_traits<char>, std::allocator<char> >)> >::_
> M_invoke(std::tr1::_Any_data const&, zookeeper::GroupProcess*) () from 
> /Users/mchadha/venv/lib/python2.7/site-packages/mesos.native-0.22.1003-py2.7-linux-x86_64.egg/mesos/native/
> _mesos.so
> #14 0x00007f4286cc2394 in std::tr1::function<void 
> ()(zookeeper::GroupProcess*)>::operator()(zookeeper::GroupProcess*) const () 
> from /Users/mchadha/venv/lib/python2.7/site-package
> s/mesos.native-0.22.1003-py2.7-linux-x86_64.egg/mesos/native/_mesos.so
> #15 0x00007f4286cbc3a2 in void 
> process::internal::vdispatcher<zookeeper::GroupProcess>(process::ProcessBase*,
>  std::tr1::shared_ptr<std::tr1::function<void ()(zookeeper::GroupProc
> ess*)> >) () from 
> /Users/mchadha/venv/lib/python2.7/site-packages/mesos.native-0.22.1003-py2.7-linux-x86_64.egg/mesos/native/_mesos.so
> #16 0x00007f4286ccdca5 in std::tr1::result_of<void 
> (*()(std::tr1::result_of<std::tr1::_Mu<std::tr1::_Placeholder<1>, false, 
> true> ()(std::tr1::_Placeholder<1>, std::tr1::tuple<pr
> ocess::ProcessBase*&>)>::type, 
> std::tr1::result_of<std::tr1::_Mu<std::tr1::shared_ptr<std::tr1::function<void
>  ()(zookeeper::GroupProcess*)> >, false, false> ()(std::tr1::shared_p
> tr<std::tr1::function<void ()(zookeeper::GroupProcess*)> >, 
> std::tr1::_Mu<std::tr1::_Placeholder<1>, false, true> 
> ()(std::tr1::_Placeholder<1>, std::tr1::tuple<process::ProcessBa
> se*&>))>::type))(process::ProcessBase*, 
> std::tr1::shared_ptr<std::tr1::function<void ()(zookeeper::GroupProcess*)> 
> >)>::type std::tr1::_Bind<void (*()(std::tr1::_Placeholder<1>,
> std::tr1::shared_ptr<std::tr1::function<void ()(zookeeper::GroupProcess*)> 
> >))(process::ProcessBase*, std::tr1::shared_ptr<std::tr1::function<void 
> ()(zookeeper::GroupProcess*)> >
> )>::__call<process::ProcessBase*&, 0, 
> 1>(std::tr1::_Mu<std::tr1::_Placeholder<1>, false, true> ( 
> const&)(std::tr1::_Placeholder<1>, std::tr1::tuple<process::ProcessBase*&>), 
> std:
> :tr1::_Index_tuple<0, 1>) () from 
> /Users/mchadha/venv/lib/python2.7/site-packages/mesos.native-0.22.1003-py2.7-linux-x86_64.egg/mesos/native/_mesos.so
> #17 0x00007f4286cc7a5a in std::tr1::result_of<void 
> (*()(std::tr1::result_of<std::tr1::_Mu<std::tr1::_Placeholder<1>, false, 
> true> ()(std::tr1::_Placeholder<1>, std::tr1::tuple<pr
> ocess::ProcessBase*>)>::type, 
> std::tr1::result_of<std::tr1::_Mu<std::tr1::shared_ptr<std::tr1::function<void
>  ()(zookeeper::GroupProcess*)> >, false, false> ()(std::tr1::shared_pt
> r<std::tr1::function<void ()(zookeeper::GroupProcess*)> >, 
> std::tr1::_Mu<std::tr1::_Placeholder<1>, false, true> 
> ()(std::tr1::_Placeholder<1>, std::tr1::tuple<process::ProcessBas
> e*>))>::type))(process::ProcessBase*, 
> std::tr1::shared_ptr<std::tr1::function<void ()(zookeeper::GroupProcess*)> 
> >)>::type std::tr1::_Bind<void (*()(std::tr1::_Placeholder<1>, st
> d::tr1::shared_ptr<std::tr1::function<void ()(zookeeper::GroupProcess*)> 
> >))(process::ProcessBase*, std::tr1::shared_ptr<std::tr1::function<void 
> ()(zookeeper::GroupProcess*)> >)>
> ::operator()<process::ProcessBase*>(process::ProcessBase*&) () from 
> /Users/mchadha/venv/lib/python2.7/site-packages/mesos.native-0.22.1003-py2.7-linux-x86_64.egg/mesos/native/_me
> sos.so
> #18 0x00007f4286cc2480 in std::tr1::_Function_handler<void 
> ()(process::ProcessBase*), std::tr1::_Bind<void 
> (*()(std::tr1::_Placeholder<1>, std::tr1::shared_ptr<std::tr1::function
> <void ()(zookeeper::GroupProcess*)> >))(process::ProcessBase*, 
> std::tr1::shared_ptr<std::tr1::function<void ()(zookeeper::GroupProcess*)> 
> >)> >::_M_invoke(std::tr1::_Any_data con
> st&, process::ProcessBase*) () from 
> /Users/mchadha/venv/lib/python2.7/site-packages/mesos.native-0.22.1003-py2.7-linux-x86_64.egg/mesos/native/_mesos.so
> #19 0x00007f42870db546 in std::tr1::function<void 
> ()(process::ProcessBase*)>::operator()(process::ProcessBase*) const () from 
> /Users/mchadha/venv/lib/python2.7/site-packages/meso
> s.native-0.22.1003-py2.7-linux-x86_64.egg/mesos/native/_mesos.so
> #20 0x00007f42870c1013 in process::ProcessBase::visit(process::DispatchEvent 
> const&) () from 
> /Users/mchadha/venv/lib/python2.7/site-packages/mesos.native-0.22.1003-py2.7-linux-x8
> 6_64.egg/mesos/native/_mesos.so
> #21 0x00007f42870c5582 in 
> process::DispatchEvent::visit(process::EventVisitor*) const () from 
> /Users/mchadha/venv/lib/python2.7/site-packages/mesos.native-0.22.1003-py2.7-linux-x
> 86_64.egg/mesos/native/_mesos.so
> #22 0x00007f428666680e in process::ProcessBase::serve(process::Event const&) 
> () from 
> /Users/mchadha/venv/lib/python2.7/site-packages/mesos.native-0.22.1003-py2.7-linux-x86_64.egg
> /mesos/native/_mesos.so
> #23 0x00007f42870bd88f in 
> process::ProcessManager::resume(process::ProcessBase*) () from 
> /Users/mchadha/venv/lib/python2.7/site-packages/mesos.native-0.22.1003-py2.7-linux-x86_64
> .egg/mesos/native/_mesos.so
> #24 0x00007f42870b1cb9 in process::schedule(void*) () from 
> /Users/mchadha/venv/lib/python2.7/site-packages/mesos.native-0.22.1003-py2.7-linux-x86_64.egg/mesos/native/_mesos.so
> #25 0x00000031410079d1 in start_thread () from /lib64/libpthread.so.0
> #26 0x00000031408e88fd in clone () from /lib64/libc.so.6
> {code}
> Solution: 
>  Create master detector per url instead of per framework.
> Will send the review request. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to