wangwei created SINGA-119:
-----------------------------
Summary: Remove job registration before launching the training
program
Key: SINGA-119
URL: https://issues.apache.org/jira/browse/SINGA-119
Project: Singa
Issue Type: New Feature
Reporter: wangwei
Assignee: Sheng Wang
Job registration, including getting the job ID, is necessary for training in a
cluster. It is done in the `bin/singa-run.sh` script and before ssh to each
node to invoke the training program.
For some situations, e.g, a small model or a single node (with multiple GPU
cards), users do not need to train the model on multiple nodes. Many models can
be trained on a single node (process) with multiple GPU cards. In this case, it
would be better to remove the Job registration step to make job launching
simple. For instance, users can start the training by
{code}
./singa -conf examples/cifar10/job.conf
{code}
or via python script SINGA-81
{code}
python tool/python/examples/cifar10.py
{code}
The Job ID is determined inside the program by cluster_rt.cc, which
communicates with the zookeeper server. We may later make zookeeper an optional
dependency for training in a single node, as it is mainly used for generating a
unique job ID.
For an extreme case where there is a single worker, we do not need to create a
server thread. In fact, we can create an Updater instance inside the worker,
which updates the parameters locally. It would speed up the training on a
single GPU card, because we do not need to transfer the gradients and
parameters between the worker and the server. Currently, we have to transfer
the gradients from worker (GPU memory) to the server (CPU memory), which is
time consuming.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)