[GitHub] rahul003 commented on a change in pull request #9152: tutorial for distributed training

GitBox Mon, 15 Jan 2018 13:25:57 -0800

rahul003 commented on a change in pull request #9152: tutorial for distributed 
training
URL: https://github.com/apache/incubator-mxnet/pull/9152#discussion_r161617656

##########
File path: docs/faq/distributed_training.md
##########
@@ -0,0 +1,286 @@
+# Distributed training
+MXNet supports distributed training enabling us to leverage multiple machines
for faster training.
+In this document, we describe how it works, how to launch a distributed
training job and
+some environment variables which provide more control.
+
+## Type of parallelism
+There are two ways in which we can distribute the workload of training a
neural network across multiple devices (can be either GPU or CPU).
+The first way is *data parallelism*, which refers to the case where each
device stores a complete copy of the model.
+Each device works with a different part of the dataset, and the devices
collectively update a shared model.
+These devices can be located on a single machine or across multiple machines.
+In this document, we describe how to train a model with devices distributed
across machines in a data parallel way.
+
+When models are so large that they don't fit into device memory, then a second
way called *model parallelism* is useful.
+Here, different devices are assigned the task of learning different parts of
the model.
+Currently, MXNet supports Model parallelism in a single machine only. Refer
[Training with multiple GPUs using model
parallelism](https://mxnet.incubator.apache.org/versions/master/how_to/model_parallel_lstm.html)
for more on this.
+
+## How does distributed training work?
+The architecture of distributed training in MXNet is as follows:
+#### Types of processes
+MXNet has three types of processes which communicate with each other to
accomplish training of a model.
+- Worker: A worker node actually performs training on a batch of training
samples.
+Before processing each batch, the workers pull weights from servers.
+The workers also send gradients to the servers after each batch.
+Depending on the workload for training a model, it might not be a good idea to
run multiple worker processes on the same machine.
+- Server: There can be multiple servers which store the model's parameters,
and communicate with workers.
+A server may or may not be co-located with the worker processes.
+- Scheduler: There is only one scheduler.
+The role of the scheduler is to set up the cluster.
+This includes waiting for messages that each node has come up and which port
the node is listening on.
+The scheduler then lets all processes know about every other node in the
cluster, so that they can communicate with each other.
+
+#### KV Store
+MXNet provides a key-value store, which is a critical component used for
multi-device and distributed training.

Review comment:
@eric-haibin-lin @pracheer Ya, multi-device seems appropriate

@aaronmarkham I've modified the paragraph as per your suggestion and added a
line about what happens when there are multiple devices on a single machine.
I've also moved up the section which describes the distribution of keys on
servers

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

With regards,
Apache Git Services

[GitHub] rahul003 commented on a change in pull request #9152: tutorial for distributed training

Reply via email to