craig bordelon created MESOS-2451:
-------------------------------------

             Summary: mesos c++ zookeeper code hangs from api operation from 
within watcher of CHANGE event
                 Key: MESOS-2451
                 URL: https://issues.apache.org/jira/browse/MESOS-2451
             Project: Mesos
          Issue Type: Bug
          Components: c++ api
    Affects Versions: 0.22.0
         Environment: red hat linux 6.5
            Reporter: craig bordelon


We've observed that that the mesos 0.22.0-rc1 c++ zookeeper code appears to 
hang (two threads stuck in indefinite pthread condition waits) on a test case 
that as best we can tell is mesos issue and not issue with underlying apache 
zookeeper C binding.
(that is we tried same type case using apache zookeeper C binding directly and 
saw no issues.)
This happens with a properly running zookeeper (standalone is sufficient).

Heres how we hung it:
We issue a mesos zk set via

int ZooKeeper::set      (       const std::string &     path,
const std::string &     data,
int     version 
)       

then inside a Watcher we process on CHANGED event to issue a mesos zk get on 
the same path via

int ZooKeeper::get      (       const std::string &     path,
bool    watch,
std::string *   result,
Stat *  stat 
)       

we end up with two threads in the process both in pthread_cond_waits
#0  0x000000334e20b43c in pthread_cond_wait@@GLIBC_2.3.2 () from 
/lib64/libpthread.so.0
#1  0x00007f6664ee1cf5 in Gate::arrive (this=0x7f6140, old=0)
    at ../../../3rdparty/libprocess/src/gate.hpp:82
#2  0x00007f6664ecef6e in process::ProcessManager::wait (this=0x7f02e0, pid=...)
    at ../../../3rdparty/libprocess/src/process.cpp:2476
#3  0x00007f6664ed2ce9 in process::wait (pid=..., duration=...)
    at ../../../3rdparty/libprocess/src/process.cpp:2958
#4  0x00007f6664e90558 in process::Latch::await (this=0x7f6ba0, duration=...)
    at ../../../3rdparty/libprocess/src/latch.cpp:49
#5  0x00007f66649452cc in process::Future<int>::await (this=0x7fffa0fd9040, 
duration=...)
    at ../../3rdparty/libprocess/include/process/future.hpp:1156
#6  0x00007f666493a04d in process::Future<int>::get (this=0x7fffa0fd9040)
    at ../../3rdparty/libprocess/include/process/future.hpp:1167
#7  0x00007f6664ab1aac in ZooKeeper::set (this=0x803ce0, path="/craig/mo", data=
...

and
#0  0x000000334e20b43c in pthread_cond_wait@@GLIBC_2.3.2 () from 
/lib64/libpthread.so.0
#1  0x00007f6664ee1cf5 in Gate::arrive (this=0x7f66380013f0, old=0)
    at ../../../3rdparty/libprocess/src/gate.hpp:82
#2  0x00007f6664ecef6e in process::ProcessManager::wait (this=0x7f02e0, pid=...)
    at ../../../3rdparty/libprocess/src/process.cpp:2476
#3  0x00007f6664ed2ce9 in process::wait (pid=..., duration=...)
    at ../../../3rdparty/libprocess/src/process.cpp:2958
#4  0x00007f6664e90558 in process::Latch::await (this=0x7f6638000d00, 
duration=...)
    at ../../../3rdparty/libprocess/src/latch.cpp:49
#5  0x00007f66649452cc in process::Future<int>::await (this=0x7f66595fb6f0, 
duration=...)
    at ../../3rdparty/libprocess/include/process/future.hpp:1156
#6  0x00007f666493a04d in process::Future<int>::get (this=0x7f66595fb6f0)
    at ../../3rdparty/libprocess/include/process/future.hpp:1167
#7  0x00007f6664ab18d3 in ZooKeeper::get (this=0x803ce0, path="/craig/mo", 
watch=false,
....

We of course have a separate "enhancement" suggestion that the mesos C++ 
zookeeper api use timed waits and not block indefinitely for responses.
But this case we think the mesos code itself is blocking on itself and not 
handling the responses.

craig



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to