huijunw commented on issue #2907: stuck stmgr due to zk client destructor
URL: 
https://github.com/apache/incubator-heron/issues/2907#issuecomment-391510536
 
 
   For the thread 0x7f616cc62700:
   It called the GetCompletionWatcher(), when a getting-zk-node operation is 
done. In the watcher, ZkActionCb-> ExecuteInEventLoop-> enqueue-> notify_one-> 
__lll_lock_wait(), stuck.
   
   For the thread main 0x7f616ecae780:
   A zk session expired, GlobalWatchEventHandler was called -> ~ZKClient() -> 
first delete piper, then zookeeper_close() -> join thread, stuck
   
   Our theory is: two events(session_expire and get_zk_node) happened, and each 
was handled in a thread. The main thread handled session_expire, while the 
other thread handled get_zk_node_done. The session_expire watcher in main wait 
for the other thread to join, while the other thread wait for a lock which was 
deleted in piper_ by  ~ZKClient() in main thread.
   
   If multi thread is intended, proposed solution is:
   reorder the delete_piper and close_zk_client
   ```
     delete piper_;
     zookeeper_close(zk_handle_);
   ```
   
https://github.com/apache/incubator-heron/blob/0.17.8/heron/common/src/cpp/zookeeper/zkclient.cpp#L146
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to