[ 
https://issues.apache.org/jira/browse/MESOS-1318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13991348#comment-13991348
 ] 

Yan Xu commented on MESOS-1318:
-------------------------------

Here is what I think happened:

1. ZooKeeperImpl publish itself to ZK C lib's {{zookeeper_init}} [in its 
constructor|https://github.com/apache/mesos/blob/f76ab279b55bef3e1b9b0982cbd401ac300f2a82/src/zookeeper/zookeeper.cpp#L65].
2. At this point ZooKeeperImpl::ctor has not returned and it definitely has not 
been [assigned to it wrapper class' impl 
member|https://github.com/apache/mesos/blob/f76ab279b55bef3e1b9b0982cbd401ac300f2a82/src/zookeeper/zookeeper.cpp#L375].
3. If ZK client fires an event at this time, it is processed by ProcessWatcher, 
which [calls 
getSessionId()|https://github.com/apache/mesos/blob/f76ab279b55bef3e1b9b0982cbd401ac300f2a82/src/zookeeper/watcher.hpp#L33].
4. Inside getSessionId() it tries to use [impl which is not assigned 
yet|https://github.com/apache/mesos/blob/f76ab279b55bef3e1b9b0982cbd401ac300f2a82/src/zookeeper/zookeeper.cpp#L393]!

The crux is that publishing an object before it's fully initialized is 
dangerous.
We can:
1. Add a void ZooKeeperImpl::start() which calls zookeeper_init.
2. Have ZooKeeper ctor call {{impl.start()}}.
{code}
ZooKeeper::ZooKeeper(const string& servers,
                     const Duration& timeout,
                     Watcher* watcher)
{
  impl = new ZooKeeperImpl(this, servers, timeout, watcher);
  impl.start();
}
{code}

This way impl is guaranteed to be fully initialized. But nothing prevents ZK 
client from sending us events before impl.start() returns (therefore Zookeeper 
is not full initialized and returned to its caller). So perhaps it's prudent to 
expose the start() or init() method to the users of ZooKeeper class as well. 

So in Group we'd do

{code}
void GroupProcess::initialize()
{
  // Doing initialization here allows to avoid the race between
  // instantiating the ZooKeeper instance and being spawned ourself.
  watcher = new ProcessWatcher<GroupProcess>(self());
  zk = new ZooKeeper(servers, timeout, watcher);
  zk.start(); 
  state = CONNECTING;
}
{code}

> ProcessWatcher triggers seg fault
> ---------------------------------
>
>                 Key: MESOS-1318
>                 URL: https://issues.apache.org/jira/browse/MESOS-1318
>             Project: Mesos
>          Issue Type: Bug
>    Affects Versions: 0.19.0
>            Reporter: Yan Xu
>             Fix For: 0.19.0
>
>
> Likely exposed by the fix for MESOS-1265.
> {noformat}
> 2014-05-06 18:01:18,943:17653(0x7f27f1117940):ZOO_INFO@log_env@712: Client 
> environment:zookeeper.version=zookeeper C client 3.4.5
> 2014-05-06 18:01:18,943:17653(0x7f27f1117940):ZOO_INFO@log_env@716: Client 
> environment:host.name=<redacted>
> 2014-05-06 18:01:18,943:17653(0x7f27f1117940):ZOO_INFO@log_env@723: Client 
> environment:os.name=Linux
> 2014-05-06 18:01:18,943:17653(0x7f27f1117940):ZOO_INFO@log_env@724: Client 
> environment:os.arch=<redacted>
> 2014-05-06 18:01:18,943:17653(0x7f27f1117940):ZOO_INFO@log_env@725: Client 
> environment:os.version=#1 SMP Mon Apr 7 15:24:34 PDT 2014
> 2014-05-06 18:01:18,943:17653(0x7f27f1117940):ZOO_INFO@log_env@733: Client 
> environment:user.name=(null)
> I0506 18:01:18.947623 17653 slave.cpp:244] Slave resources: cpus(*):0.01; 
> mem(*):160; disk(*):480; ports(*):[31780-31780]
> I0506 18:01:19.030995 17653 slave.cpp:272] Slave hostname: <redacted>
> I0506 18:01:19.031070 17653 slave.cpp:273] Slave checkpoint: true
> I0506 18:01:19.049667 17674 state.cpp:33] Recovering state from 
> '/var/lib/mesos/slaves/20140416-015639-1890854154-5050-1354-24096/frameworks/201103282247-0000000019-0000/executors/thermos-1399399159295-mesos-test-meta-slave-1-424-bb99b160-9bb9-4f9f-ac75-378ca9ef5957/runs/09c67d7a-11f3-4054-bcde-3256f1d17dc6/sandbox/work_3/meta'
> I0506 18:01:19.051961 17674 status_update_manager.cpp:193] Recovering status 
> update manager
> I0506 18:01:19.052105 17674 mesos_containerizer.cpp:201] Recovering 
> containerizer
> I0506 18:01:19.052505 17674 slave.cpp:2943] Finished recovery
> 2014-05-06 18:01:19,057:17653(0x7f27f1117940):ZOO_INFO@log_env@741: Client 
> environment:user.home=/home/mesos
> 2014-05-06 18:01:19,057:17653(0x7f27f1117940):ZOO_INFO@log_env@753: Client 
> environment:user.dir=/var/lib/mesos/slaves/20140416-015639-1890854154-5050-1354-24096/frameworks/201103282247-0000000019-0000/executors/thermos-1399399159295-mesos-test-meta-slave-1-424-bb99b160-9bb9-4f9f-ac75-378ca9ef5957/runs/09c67d7a-11f3-4054-bcde-3256f1d17dc6/sandbox
> 2014-05-06 18:01:19,057:17653(0x7f27f1117940):ZOO_INFO@zookeeper_init@786: 
> Initiating client connection, host=<redacted> sessionTimeout=10000 
> watcher=0x7f27f8f311f0 sessionId=0 sessionPasswd=<null> context=0x249aed0 
> flags=0
> 2014-05-06 18:01:19,352:17653(0x7f27ec4fe940):ZOO_INFO@check_events@1703: 
> initiated connection to server [10.36.79.123:2181]
> 2014-05-06 18:01:19,354:17653(0x7f27ec4fe940):ZOO_INFO@check_events@1750: 
> session establishment complete on server [10.36.79.123:2181], 
> sessionId=0x245af1f5caa8812, negotiated timeout=10000
> *** Aborted at 1399399279 (unix time) try "date -d @1399399279" if you are 
> using GNU date ***
> PC: @     0x7f27f8f2f7c0 ZooKeeper::getSessionId()
> *** SIGSEGV (@0x0) received by PID 17653 (TID 0x7f27ebcfd940) from PID 0; 
> stack trace: ***
>     @     0x7f27f8616ca0 (unknown)
>     @     0x7f27f8f2f7c0 ZooKeeper::getSessionId()
>     @     0x7f27f8f4ffcc ProcessWatcher<>::process()
>     @     0x7f27f8f31238 ZooKeeperImpl::event()
>     @     0x7f27f92292d2 deliverWatchers
>     @     0x7f27f921fe33 process_completions
>     @     0x7f27f9224bc1 do_completion
>     @     0x7f27f860e83d start_thread
>     @     0x7f27f737626d clone
> /bin/bash: line 1: 17653 Segmentation fault      (core dumped) 
> META_THERMOS_ROOT=$(pwd)/work_3 /usr/local/sbin/mesos-slave 
> --work_dir="$(pwd)/work_3" --mastsandbox/.logs/mesos-slave-3/0/stderr 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to