[ 
https://issues.apache.org/jira/browse/MESOS-2451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14393222#comment-14393222
 ] 

craig bordelon commented on MESOS-2451:
---------------------------------------

How do you guys want to close this?
Maybe some doc comments in the online zookeeper.hpp to warn about not calling 
various methods while within the callback into Watcher handling.
At least this way the next person won't stumble into it.
cheers


> mesos c++ zookeeper code hangs from api operation from within watcher of 
> CHANGE event
> -------------------------------------------------------------------------------------
>
>                 Key: MESOS-2451
>                 URL: https://issues.apache.org/jira/browse/MESOS-2451
>             Project: Mesos
>          Issue Type: Bug
>          Components: c++ api
>    Affects Versions: 0.22.0
>         Environment: red hat linux 6.5
>            Reporter: craig bordelon
>            Assignee: Benjamin Hindman
>         Attachments: Makefile, bug.cpp, bug0.cpp, log.h
>
>
> We've observed that that the mesos 0.22.0-rc1 c++ zookeeper code appears to 
> hang (two threads stuck in indefinite pthread condition waits) on a test case 
> that as best we can tell is mesos issue and not issue with underlying apache 
> zookeeper C binding.
> (that is we tried same type case using apache zookeeper C binding directly 
> and saw no issues.)
> This happens with a properly running zookeeper (standalone is sufficient).
> Heres how we hung it:
> We issue a mesos zk set via
> int ZooKeeper::set      (       const std::string &     path,
> const std::string &     data,
> int     version 
> )       
> then inside a Watcher we process on CHANGED event to issue a mesos zk get on 
> the same path via
> int ZooKeeper::get      (       const std::string &     path,
> bool    watch,
> std::string *   result,
> Stat *  stat 
> )       
> we end up with two threads in the process both in pthread_cond_waits
> #0  0x000000334e20b43c in pthread_cond_wait@@GLIBC_2.3.2 () from 
> /lib64/libpthread.so.0
> #1  0x00007f6664ee1cf5 in Gate::arrive (this=0x7f6140, old=0)
>     at ../../../3rdparty/libprocess/src/gate.hpp:82
> #2  0x00007f6664ecef6e in process::ProcessManager::wait (this=0x7f02e0, 
> pid=...)
>     at ../../../3rdparty/libprocess/src/process.cpp:2476
> #3  0x00007f6664ed2ce9 in process::wait (pid=..., duration=...)
>     at ../../../3rdparty/libprocess/src/process.cpp:2958
> #4  0x00007f6664e90558 in process::Latch::await (this=0x7f6ba0, duration=...)
>     at ../../../3rdparty/libprocess/src/latch.cpp:49
> #5  0x00007f66649452cc in process::Future<int>::await (this=0x7fffa0fd9040, 
> duration=...)
>     at ../../3rdparty/libprocess/include/process/future.hpp:1156
> #6  0x00007f666493a04d in process::Future<int>::get (this=0x7fffa0fd9040)
>     at ../../3rdparty/libprocess/include/process/future.hpp:1167
> #7  0x00007f6664ab1aac in ZooKeeper::set (this=0x803ce0, path="/craig/mo", 
> data=
> ...
> and
> #0  0x000000334e20b43c in pthread_cond_wait@@GLIBC_2.3.2 () from 
> /lib64/libpthread.so.0
> #1  0x00007f6664ee1cf5 in Gate::arrive (this=0x7f66380013f0, old=0)
>     at ../../../3rdparty/libprocess/src/gate.hpp:82
> #2  0x00007f6664ecef6e in process::ProcessManager::wait (this=0x7f02e0, 
> pid=...)
>     at ../../../3rdparty/libprocess/src/process.cpp:2476
> #3  0x00007f6664ed2ce9 in process::wait (pid=..., duration=...)
>     at ../../../3rdparty/libprocess/src/process.cpp:2958
> #4  0x00007f6664e90558 in process::Latch::await (this=0x7f6638000d00, 
> duration=...)
>     at ../../../3rdparty/libprocess/src/latch.cpp:49
> #5  0x00007f66649452cc in process::Future<int>::await (this=0x7f66595fb6f0, 
> duration=...)
>     at ../../3rdparty/libprocess/include/process/future.hpp:1156
> #6  0x00007f666493a04d in process::Future<int>::get (this=0x7f66595fb6f0)
>     at ../../3rdparty/libprocess/include/process/future.hpp:1167
> #7  0x00007f6664ab18d3 in ZooKeeper::get (this=0x803ce0, path="/craig/mo", 
> watch=false,
> ....
> We of course have a separate "enhancement" suggestion that the mesos C++ 
> zookeeper api use timed waits and not block indefinitely for responses.
> But this case we think the mesos code itself is blocking on itself and not 
> handling the responses.
> craig



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to