[ 
https://issues.apache.org/jira/browse/MESOS-2451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14351355#comment-14351355
 ] 

craig bordelon commented on MESOS-2451:
---------------------------------------

So, I think you're really asking me/us to skinny down the test case more and 
try to submit as source code. Because otherwise my description, while brief, is 
fairly self contained.
It will take me a little while to do so and given friday nite in eastern US, 
probably not today :)
But i should give it a try.
In meanwhile if there is known issues with calling mesos c++ zookeeper api 
functions from within the Watcher process method called to respond to some 
earlier same process zookeeper api calls (mesos or even zookeeper C binding), 
then that would be help.

For eg:
In last couple days I tried futilely to get to a work-around.
For instance i tried a second connection via mesos to zookeeper and perform the 
mesos create/set on path from this connection. While meanwhile the original 
connection maintained the watch and would perform the mesos get on the same 
path from with the Watcher process method. Still hung.
I then tried a bit of a hack to get the second connections underlying apache 
zookeeper zhandle_t* and issue the asynchronous aget via zookeeper C binding 
api rather than mesos C++ get api. This gets to a different hung state when we 
use our own pthread condition variable to "wait_for" the get response (from our 
completion function). Only when we time-out the wait-for, do we then afterward 
see that the completion function finally gets called and tries to pthread 
notify_all on the condition variable which is no longer waiting.
So, we have been unable to work-around the issue as of now.


> mesos c++ zookeeper code hangs from api operation from within watcher of 
> CHANGE event
> -------------------------------------------------------------------------------------
>
>                 Key: MESOS-2451
>                 URL: https://issues.apache.org/jira/browse/MESOS-2451
>             Project: Mesos
>          Issue Type: Bug
>          Components: c++ api
>    Affects Versions: 0.22.0
>         Environment: red hat linux 6.5
>            Reporter: craig bordelon
>            Assignee: Benjamin Hindman
>
> We've observed that that the mesos 0.22.0-rc1 c++ zookeeper code appears to 
> hang (two threads stuck in indefinite pthread condition waits) on a test case 
> that as best we can tell is mesos issue and not issue with underlying apache 
> zookeeper C binding.
> (that is we tried same type case using apache zookeeper C binding directly 
> and saw no issues.)
> This happens with a properly running zookeeper (standalone is sufficient).
> Heres how we hung it:
> We issue a mesos zk set via
> int ZooKeeper::set      (       const std::string &     path,
> const std::string &     data,
> int     version 
> )       
> then inside a Watcher we process on CHANGED event to issue a mesos zk get on 
> the same path via
> int ZooKeeper::get      (       const std::string &     path,
> bool    watch,
> std::string *   result,
> Stat *  stat 
> )       
> we end up with two threads in the process both in pthread_cond_waits
> #0  0x000000334e20b43c in pthread_cond_wait@@GLIBC_2.3.2 () from 
> /lib64/libpthread.so.0
> #1  0x00007f6664ee1cf5 in Gate::arrive (this=0x7f6140, old=0)
>     at ../../../3rdparty/libprocess/src/gate.hpp:82
> #2  0x00007f6664ecef6e in process::ProcessManager::wait (this=0x7f02e0, 
> pid=...)
>     at ../../../3rdparty/libprocess/src/process.cpp:2476
> #3  0x00007f6664ed2ce9 in process::wait (pid=..., duration=...)
>     at ../../../3rdparty/libprocess/src/process.cpp:2958
> #4  0x00007f6664e90558 in process::Latch::await (this=0x7f6ba0, duration=...)
>     at ../../../3rdparty/libprocess/src/latch.cpp:49
> #5  0x00007f66649452cc in process::Future<int>::await (this=0x7fffa0fd9040, 
> duration=...)
>     at ../../3rdparty/libprocess/include/process/future.hpp:1156
> #6  0x00007f666493a04d in process::Future<int>::get (this=0x7fffa0fd9040)
>     at ../../3rdparty/libprocess/include/process/future.hpp:1167
> #7  0x00007f6664ab1aac in ZooKeeper::set (this=0x803ce0, path="/craig/mo", 
> data=
> ...
> and
> #0  0x000000334e20b43c in pthread_cond_wait@@GLIBC_2.3.2 () from 
> /lib64/libpthread.so.0
> #1  0x00007f6664ee1cf5 in Gate::arrive (this=0x7f66380013f0, old=0)
>     at ../../../3rdparty/libprocess/src/gate.hpp:82
> #2  0x00007f6664ecef6e in process::ProcessManager::wait (this=0x7f02e0, 
> pid=...)
>     at ../../../3rdparty/libprocess/src/process.cpp:2476
> #3  0x00007f6664ed2ce9 in process::wait (pid=..., duration=...)
>     at ../../../3rdparty/libprocess/src/process.cpp:2958
> #4  0x00007f6664e90558 in process::Latch::await (this=0x7f6638000d00, 
> duration=...)
>     at ../../../3rdparty/libprocess/src/latch.cpp:49
> #5  0x00007f66649452cc in process::Future<int>::await (this=0x7f66595fb6f0, 
> duration=...)
>     at ../../3rdparty/libprocess/include/process/future.hpp:1156
> #6  0x00007f666493a04d in process::Future<int>::get (this=0x7f66595fb6f0)
>     at ../../3rdparty/libprocess/include/process/future.hpp:1167
> #7  0x00007f6664ab18d3 in ZooKeeper::get (this=0x803ce0, path="/craig/mo", 
> watch=false,
> ....
> We of course have a separate "enhancement" suggestion that the mesos C++ 
> zookeeper api use timed waits and not block indefinitely for responses.
> But this case we think the mesos code itself is blocking on itself and not 
> handling the responses.
> craig



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to