This is an automated email from the ASF dual-hosted git repository.
thomasdelteil pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-mxnet.git
The following commit(s) were added to refs/heads/master by this push:
new 41d35c4 [TUTORIAL] Add multiple GPUs training tutorial (#15158)
41d35c4 is described below
commit 41d35c4566b0e10041f16c2ebb06133e68736775
Author: Sergey Sokolov <[email protected]>
AuthorDate: Fri Jun 14 09:55:41 2019 -0700
[TUTORIAL] Add multiple GPUs training tutorial (#15158)
* Add multiple GPUs training tutorial
* Add download source button
* Add tutorial to the test suite
* Remove from nightly build (no CI multigpu machines)
* Add extension to whitelisted multigpu tutorial
* Force build
* Force update
* Code review fixes
* Force build
* Typo fix and force build
* Add tutorial back to tests
* Add tutorial to the index
* Force build
---
docs/tutorials/gluon/multi_gpu.md | 193 ++++++++++++++++++++++++++++++++++++++
docs/tutorials/index.md | 6 +-
tests/tutorials/test_tutorials.py | 3 +
3 files changed, 199 insertions(+), 3 deletions(-)
diff --git a/docs/tutorials/gluon/multi_gpu.md
b/docs/tutorials/gluon/multi_gpu.md
new file mode 100644
index 0000000..8e446dc
--- /dev/null
+++ b/docs/tutorials/gluon/multi_gpu.md
@@ -0,0 +1,193 @@
+<!--- Licensed to the Apache Software Foundation (ASF) under one -->
+<!--- or more contributor license agreements. See the NOTICE file -->
+<!--- distributed with this work for additional information -->
+<!--- regarding copyright ownership. The ASF licenses this file -->
+<!--- to you under the Apache License, Version 2.0 (the -->
+<!--- "License"); you may not use this file except in compliance -->
+<!--- with the License. You may obtain a copy of the License at -->
+
+<!--- http://www.apache.org/licenses/LICENSE-2.0 -->
+
+<!--- Unless required by applicable law or agreed to in writing, -->
+<!--- software distributed under the License is distributed on an -->
+<!--- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -->
+<!--- KIND, either express or implied. See the License for the -->
+<!--- specific language governing permissions and limitations -->
+<!--- under the License. -->
+
+# Multiple GPUs training with Gluon API
+
+In this tutorial we will walk through how one can train deep learning neural
networks on multiple GPUs within a single machine. This tutorial focuses on
data parallelism as opposed to model parallelism. Data parallelism approach
assumes, that you can fit whole your model in a GPU and only training data
needs to be partitioned. This is different from model parallelism, where the
model is so big, that it doesn't fit into a single GPU, so it needs to be
partitioned as well. Model parallelis [...]
+Here we will focus on implementing data parallel training for a convolutional
neural network called LeNet.
+
+## Prerequisites
+
+- Two or more GPUs
+- CUDA 9 or higher
+- cuDNN v7 or higher
+- Knowledge of how to train a model using Gluon API
+
+## Storing data on GPU
+
+The basic primitive in Apache MXNet to specify a tensor is
[NDArray](https://mxnet.incubator.apache.org/api/python/ndarray/sparse.html#module-mxnet.ndarray).
When you create NDArray you have to provide the context - the device where
this tensor is going to be stored. The context can be either CPU or GPU and
both can be indexed: if your machine has multiple GPUs, you can provide an
index to specify which GPU to use. By default, CPU context is used, and that
means that the tensor will live [...]
+
+```python
+import mxnet as mx
+
+n_gpu = mx.context.num_gpus()
+context = [mx.gpu(0), mx.gpu(1)] if n_gpu >= 2 else \
+ [mx.gpu(), mx.gpu()] if n_gpu == 1 else \
+ [mx.cpu(), mx.cpu()]
+
+a = mx.nd.array([1, 2, 3], ctx=context[0])
+b = mx.nd.array([5, 6, 7], ctx=context[1])
+```
+
+The next step would be to do operations on these 2 NDArrays. But,
unfortunately, if we try to do any operation involved both these arrays, Apache
MXNet will return an error: `Check failed: e == cudaSuccess CUDA: an illegal
memory access was encountered`. This error is returned because we tried to use
arrays that are stored on different contexts: Apache MXNet wants users to
explicitly control memory allocation and doesn't automatically copy data
between GPUs. If we want to do an operation [...]
+
+We can manually copy data between GPUs using [as_in_context
method](https://mxnet.incubator.apache.org/api/python/ndarray/ndarray.html?#mxnet.ndarray.NDArray.as_in_context).
We can get the current context of an NDArray via [context
property](https://mxnet.incubator.apache.org/api/python/ndarray/ndarray.html?#mxnet.ndarray.NDArray.context).
+
+```python
+c = a + b.as_in_context(a.context)
+```
+
+Using this example, we have learnt that we can perform operations with
NDArrays only if they are stored on the same GPU. So, how can we split the data
between GPUs, but use the same model for training? We will answer this question
in the next section.
+
+## Storing the network on multiple GPUs
+
+When you create a network using
[Blocks](https://mxnet.incubator.apache.org/api/python/gluon/gluon.html#mxnet.gluon.Block)
the parameters of blocks are also stored in NDArrays. When you initialize your
network, you have to specify which context you are going to use for the
underlying NDArrays. The feature of the [initialize
method](https://mxnet.incubator.apache.org/api/python/gluon/gluon.html#mxnet.gluon.Block.initialize)
is that it can accept the list of contexts, meaning that you can [...]
+
+```python
+from mxnet import init
+from mxnet.gluon import nn
+
+net = nn.Sequential()
+net.add(nn.Conv2D(channels=6, kernel_size=5, activation='relu'),
+ nn.MaxPool2D(pool_size=2, strides=2),
+ nn.Conv2D(channels=16, kernel_size=3, activation='relu'),
+ nn.MaxPool2D(pool_size=2, strides=2),
+ nn.Flatten(),
+ nn.Dense(120, activation="relu"),
+ nn.Dense(84, activation="relu"),
+ nn.Dense(10))
+
+net.initialize(init=init.Xavier(), ctx=context)
+```
+
+The actual initialization will happen once we do the first forward pass on the
network, but at this stage Apache MXNet knows that we are expecting parameters
of the network to be on both GPUs.
+
+## Multiple GPUs training schema
+
+At this moment, we have learnt how to define NDArrays in different contexts
and that a network can be initialized on two GPUs at the same time.
+
+To do multiple GPU training with a given batch of the data, we divide the
examples in the batch into number of portions equal to the number of GPUs we
use and distribute one to each GPU. Then, each GPU will individually calculate
the local gradient of the model parameters based on the batch subset it was
assigned and the model parameters it maintains. Next, we sum together the local
gradients on the GPUs to get the current batch stochastic gradient. After that,
each GPU uses this batch s [...]
+
+
+
+This approach allows us to avoid the limitation of doing operations on
different GPUs - we move subsets of data to each GPU and the operations are
happening inside each individual GPU only. After that we aggregate the
resulting gradients and each GPU receives a copy of the gradients to do model
parameters update.
+
+Using that approach, knowing a way to move data between contexts and how to
initialize a model on multiple contexts, we already know everything that is
needed to do multiple GPU training. But Apache MXNet also provides us a
convenient method to distribute data between multiple GPUs, which we are going
to cover in the section below.
+
+## Splitting data between GPUs
+
+Apache MXNet provides a utility method
[gluon.utils.split_and_load](https://mxnet.incubator.apache.org/api/python/gluon/gluon.html#mxnet.gluon.utils.split_and_load)
to split the data between multiple contexts. The result of the method's call
is a list of NDArrays each of which is stored on a separate context provided in
the `ctx_list` argument. The code below demonstrates how to use the method:
+
+```python
+data = mx.random.uniform(shape=(100, 10))
+result = mx.gluon.utils.split_and_load(data, ctx_list=context)
+```
+
+If we explore the result, we will notice, that `split_and_load` method divided
the data in two chunks of the same shape `(50, 10)`. If the number of elements
is uneven, we have to specify `even_split=False` to instruct the method to do
uneven split.
+
+At this point we are ready to assemble a complete multiple GPUs training
example.
+
+## Multiple GPUs classification of MNIST images
+
+In the first step, we are going to load the MNIST images and use
[ToTensor](https://mxnet.apache.org/api/python/gluon/data.html#mxnet.gluon.data.vision.transforms.ToTensor)
to convert the format of the data from `height x width x channel` to `channel
x height x width` and divide it by 255.
+
+```python
+train_data =
mx.gluon.data.vision.MNIST(train=True).transform_first(mx.gluon.data.vision.transforms.ToTensor())
+val_data =
mx.gluon.data.vision.MNIST(train=False).transform_first(mx.gluon.data.vision.transforms.ToTensor())
+```
+
+The next step is to create a
[DataLoader](https://mxnet.incubator.apache.org/api/python/gluon/data.html#mxnet.gluon.data.DataLoader)
which constructs batches from the dataset. We create one for the training and
one for the validation datasets.
+
+```python
+batch_size = 128
+train_loader = mx.gluon.data.DataLoader(train_data, shuffle=True,
batch_size=batch_size)
+val_loader = mx.gluon.data.DataLoader(val_data, shuffle=False,
batch_size=batch_size)
+```
+
+After that we define the
[Trainer](https://mxnet.incubator.apache.org/api/python/gluon/gluon.html#trainer)
that defines the optimization algorithm to be used and hyperparameters as well
as the
[Loss](https://mxnet.incubator.apache.org/api/python/gluon/loss.html#mxnet.gluon.loss.SoftmaxCrossEntropyLoss)
function and a
[metric](https://mxnet.incubator.apache.org/api/python/metric/metric.html#mxnet.metric.Accuracy)
to track:
+
+```python
+trainer = mx.gluon.Trainer(
+ params=net.collect_params(),
+ optimizer='sgd',
+ optimizer_params={'learning_rate': 0.04},
+)
+
+metric = mx.metric.Accuracy()
+loss_function = mx.gluon.loss.SoftmaxCrossEntropyLoss()
+```
+
+After these preparations we are ready to define the training loop. In the
training loop we will split the data between GPUs, pass them all through the
individual GPU, do the backward step on each loss to accumulate the gradients,
and call
[trainer.step](https://mxnet.incubator.apache.org/api/python/gluon/gluon.html#mxnet.gluon.Trainer.step)
to actually update the parameters of the model:
+
+```python
+num_epochs = 10
+
+for epoch in range(num_epochs):
+ for inputs, labels in train_loader:
+ actual_batch_size = inputs.shape[0]
+ # Split data among GPUs. Since split_and_load is a deterministic
function
+ # inputs and labels are going to be split in the same way between GPUs.
+ inputs = mx.gluon.utils.split_and_load(inputs, ctx_list=context,
even_split=False)
+ labels = mx.gluon.utils.split_and_load(labels, ctx_list=context,
even_split=False)
+
+ # The forward pass and the loss computation need to be wrapped
+ # in a `record()` scope to make sure the computational graph is
+ # recorded in order to automatically compute the gradients
+ # during the backward pass.
+ with mx.autograd.record():
+ outputs = [net(input_slice) for input_slice in inputs]
+ losses = [loss_function(o, l) for o, l in zip(outputs, labels)]
+
+ # Iterate over losses to compute gradients for each input slice
+ for loss in losses:
+ loss.backward()
+
+ # update metric for each output
+ for l, o in zip(labels, outputs):
+ metric.update(l, o)
+
+ # Update the parameters by stepping the trainer; the batch size
+ # is required to normalize the gradients by `1 / batch_size`.
+ trainer.step(batch_size=actual_batch_size)
+
+ # Print the evaluation metric and reset it for the next epoch
+ name, acc = metric.get()
+ print('After epoch {}: {} = {}'.format(epoch + 1, name, acc))
+ metric.reset()
+```
+
+If you run this example and run `nvidia-smi` tool from NVIDIA, you will notice
that both GPUs are used to perform calculations.
+
+## Advanced topic
+
+As we mentioned above, the gradients for each data split are calculated
independently and then later summed together. We haven't mentioned yet where
exactly this aggregation happens.
+
+Apache MXNet uses
[KVStore](https://mxnet.incubator.apache.org/versions/master/api/scala/kvstore.html)
- a virtual place for data sharing between different devices, including
machines and GPUs. The KVStore is responsible for storing and, by default,
aggregating the gradients of the model. The physical location of the KVStore is
defined when we create a
[Trainer](https://mxnet.incubator.apache.org/versions/master/api/python/gluon/gluon.html#mxnet.gluon.Trainer)
and by default is set to `d [...]
+
+The first thing is there is an additional memory allocation that happens on
GPUs that is not directly related to your data and your model to store
auxiliary information for GPUs sync-up. Depending on the complexity of your
model, the amount of required memory can be significant, and you may even
experience CUDA out of memory exceptions. If that is the case, and you cannot
decrease batch size anymore, you may want to consider switching `KVStore`
storage to RAM by setting `kvstore` argumen [...]
+
+The second thing is that since the auxiliary information is distributed among
GPUs in round-robin fashion on per block level, `KVStore` may use more memory
on some GPUs and less on others. For example, if your model has a very big
embedding layer, you may see that your first GPU uses 90% of your memory while
others use only 50%. That affects how much data you actually can load in a
single batch, because the data between devices is split evenly. If that is the
case and you have to keep o [...]
+
+## Conclusion
+
+With Apache MXNet training using multiple GPUs doesn't need a lot of extra
code. To do the multiple GPUs training you need to initialize a model on all
GPUs, split the batches of data into separate splits where each is stored on a
different GPU and run the model separately on every split. The synchronization
of gradients and parameters between GPUs is done automatically by Apache MXNet.
+
+## Recommended Next Steps
+
+* Check out our two video tutorial on improving your code performance. In the
[first video](https://www.youtube.com/watch?v=n8tN6pRZBdE) we explain how to
visualize the performance, and in the [second
video](https://www.youtube.com/watch?v=Cqo7FPftNyo) we show how to optimize it.
+
+<!-- INSERT SOURCE DOWNLOAD BUTTONS -->
\ No newline at end of file
diff --git a/docs/tutorials/index.md b/docs/tutorials/index.md
index 2527ccf..c6b151c 100644
--- a/docs/tutorials/index.md
+++ b/docs/tutorials/index.md
@@ -91,7 +91,7 @@ Select API:
* [Image similiarity search with InfoGAN](/tutorials/gluon/info_gan.html)
* Practitioner Guides
* [Gotchas using NumPy](/tutorials/gluon/gotchas_numpy_in_mxnet.html)
- * [Multi-GPU
training](http://gluon.mxnet.io/chapter07_distributed-learning/multiple-gpus-gluon.html)
<img
src="https://upload.wikimedia.org/wikipedia/commons/6/6a/External_link_font_awesome.svg"
alt="External link" height="15px" style="margin: 0px 0px 3px 3px;"/>
+ * [Multi-GPU training](/tutorials/gluon/multi_gpu.html)<span
style="color:red"> (new!)
([Alternative](http://gluon.mxnet.io/chapter07_distributed-learning/multiple-gpus-gluon.html)
<img
src="https://upload.wikimedia.org/wikipedia/commons/6/6a/External_link_font_awesome.svg"
alt="External link" height="15px" style="margin: 0px 0px 3px 3px;"/>)
* [Checkpointing and Model Serialization (a.k.a. saving and
loading)](/tutorials/gluon/save_load_params.html) <img
src="https://upload.wikimedia.org/wikipedia/commons/6/6a/External_link_font_awesome.svg"
alt="External link" height="15px" style="margin: 0px 0px 3px 3px;"/>
([Alternative](http://gluon.mxnet.io/chapter03_deep-neural-networks/serialization.html))
* [Distributed
Training](https://github.com/apache/incubator-mxnet/tree/master/example/distributed_training)
* [Inference using an ONNX
model](/tutorials/onnx/inference_on_onnx_model.html)
@@ -101,7 +101,7 @@ Select API:
* [Learning Rate Schedules](/tutorials/gluon/learning_rate_schedules.html)
* [Advanced Learning Rate
Schedules](/tutorials/gluon/learning_rate_schedules_advanced.html)
* [Profiling MXNet Models](/tutorials/python/profiler.html)
- * [Module to Gluon API](/tutorials/python/module_to_gluon.html)<span
style="color:red"> (new!)</span>
+ * [Module to Gluon API](/tutorials/python/module_to_gluon.html)</span>
* [Gluon end to end from training to
inference](/tutorials/gluon/gluon_from_experiment_to_deployment.html)
* [Automatic Mixed Precision in Gluon](/tutorials/amp/amp_tutorial.html)
@@ -127,7 +127,7 @@ Select API:
* [HybridBlocks](/tutorials/gluon/hybrid.html)
([Alternative](http://gluon.mxnet.io/chapter07_distributed-learning/hybridize.html)
<img
src="https://upload.wikimedia.org/wikipedia/commons/6/6a/External_link_font_awesome.svg"
alt="External link" height="15px" style="margin: 0px 0px 3px 3px;"/>)
* [Block Naming](/tutorials/gluon/naming.html)
* [Custom Operators](/tutorials/gluon/customop.html)
- * [Control Flow
operators](/tutorials/control_flow/ControlFlowTutorial.html)<span
style="color:red"> (new!)</span>
+ * [Control Flow
operators](/tutorials/control_flow/ControlFlowTutorial.html)
* Autograd
* [AutoGrad API](/tutorials/gluon/autograd.html)
* [AutoGrad API with chain
rule](http://gluon.mxnet.io/chapter01_crashcourse/autograd.html) <img
src="https://upload.wikimedia.org/wikipedia/commons/6/6a/External_link_font_awesome.svg"
alt="External link" height="15px" style="margin: 0px 0px 3px 3px;"/>
diff --git a/tests/tutorials/test_tutorials.py
b/tests/tutorials/test_tutorials.py
index bbb45c7..2237906 100644
--- a/tests/tutorials/test_tutorials.py
+++ b/tests/tutorials/test_tutorials.py
@@ -103,6 +103,9 @@ def test_gluon_autograd():
def test_gluon_gluon():
assert _test_tutorial_nb('gluon/gluon')
+def test_gluon_multi_gpu():
+ assert _test_tutorial_nb('gluon/multi_gpu')
+
def test_gluon_save_load_params():
assert _test_tutorial_nb('gluon/save_load_params')