huijunw commented on issue #2907: stuck stmgr due to zk client destructor URL: https://github.com/apache/incubator-heron/issues/2907#issuecomment-391510536 For the thread 0x7f616cc62700: It called the GetCompletionWatcher(), when a getting-zk-node operation is done. In the watcher, ZkActionCb-> ExecuteInEventLoop-> enqueue-> notify_one-> __lll_lock_wait(), stuck. For the thread main 0x7f616ecae780: A zk session expired, GlobalWatchEventHandler was called -> ~ZKClient() -> first delete piper, then zookeeper_close() -> join thread, stuck Our theory is: two events(session_expire and get_zk_node) happened, and each was handled in a thread. The main thread handled session_expire, while the other thread handled get_zk_node_done. The session_expire watcher in main wait for the other thread to join, while the other thread wait for a lock which was deleted by ~ZKClient() in main thread. If multi thread is intended, proposed solution is: reorder the delete_piper and close_zk_client ``` delete piper_; zookeeper_close(zk_handle_); ``` https://github.com/apache/incubator-heron/blob/0.17.8/heron/common/src/cpp/zookeeper/zkclient.cpp#L146
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
