[
https://issues.apache.org/jira/browse/SINGA-3?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14562217#comment-14562217
]
Sheng Wang commented on SINGA-3:
--------------------------------
Pull Request #3 has been merged to add this feature.
> Use Zookeeper to check stopping (finish) time of the system
> -----------------------------------------------------------
>
> Key: SINGA-3
> URL: https://issues.apache.org/jira/browse/SINGA-3
> Project: Singa
> Issue Type: New Feature
> Environment: Linux, gcc>4.8
> Reporter: wangwei
>
> To stop each process (node), we need to stop both its local workers and
> servers. For worker threads, they will exit when they finish all training
> steps. For server threads, they can exit only when all connected workers have
> stopped.
> We use Zookeeper to detect the worker state. In specific, the main thread of
> each process registers all local servers firstly to the Zookeeper. Then it
> registers each worker to a dedicated server group, where its parameters are
> maintained. When one worker finishes execution, it de-register from the
> server group (folder) in the Zookeeper and tells the main thread about its
> state. When all workers registered in one server group finish, the callback
> function registered for server group will send a stop message to him. The
> server tells the main thread about its state and stops upon receiving this
> message. Once all local workers and local servers finish, the main thread
> exit.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)