wangwei created SINGA-3:
---------------------------
Summary: Use Zookeeper to check stopping (finish) time of the
system
Key: SINGA-3
URL: https://issues.apache.org/jira/browse/SINGA-3
Project: Singa
Issue Type: New Feature
Environment: Linux, gcc>4.8
Reporter: wangwei
To stop each process (node), we need to stop both its local workers and
servers. For worker threads, they will exit when they finish all training
steps. For server threads, they can exit only when all connected workers have
stopped.
We use Zookeeper to detect the worker state. In specific, the main thread of
each process registers all local servers firstly to the Zookeeper. Then it
registers each worker to a dedicated server group, where its parameters are
maintained. When one worker finishes execution, it de-register from the server
group (folder) in the Zookeeper and tells the main thread about its state. When
all workers registered in one server group finish, the callback function
registered for server group will send a stop message to him. The server tells
the main thread about its state and stops upon receiving this message. Once all
local workers and local servers finish, the main thread exit.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)