[GitHub] ThomasDelteil closed pull request #12298: Remove model parallelism references from FAQ

GitBox Mon, 24 Sep 2018 11:58:02 -0700

ThomasDelteil closed pull request #12298: Remove model parallelism references 
from FAQ
URL: https://github.com/apache/incubator-mxnet/pull/12298


This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/docs/faq/distributed_training.md b/docs/faq/distributed_training.md
index d4fa72db23a..abb90ba4fad 100644
--- a/docs/faq/distributed_training.md
+++ b/docs/faq/distributed_training.md
@@ -12,7 +12,6 @@ In this document, we describe how to train a model with 
devices distributed acro
 
 When models are so large that they don't fit into device memory, then a second 
way called *model parallelism* is useful.
 Here, different devices are assigned the task of learning different parts of 
the model.
-Currently, MXNet supports Model parallelism in a single machine only. Refer 
[Training with multiple GPUs using model 
parallelism](https://mxnet.incubator.apache.org/versions/master/faq/model_parallel_lstm.html)
 for more on this.
 
 ## How Does Distributed Training Work?
 The following concepts are key to understanding distributed training in MXNet:
diff --git a/docs/faq/index.md b/docs/faq/index.md
index 07dd9b9d7ca..84a052d8a9f 100644
--- a/docs/faq/index.md
+++ b/docs/faq/index.md
@@ -22,8 +22,6 @@ and full working examples, visit the [tutorials 
section](../tutorials/index.md).
 
 * [How can I train using multiple machines with data 
parallelism?](http://mxnet.io/faq/distributed_training.html)
 
-* [How can I train using multiple GPUs with model 
parallelism?](http://mxnet.io/faq/model_parallel_lstm.html)
-
 
 ## Speed
 * [How do I use gradient compression with distributed 
training?](http://mxnet.io/faq/gradient_compression.html)
diff --git a/docs/faq/model_parallel_lstm.md b/docs/faq/model_parallel_lstm.md
deleted file mode 100644
index b78b2c574dc..00000000000
--- a/docs/faq/model_parallel_lstm.md
+++ /dev/null
@@ -1,75 +0,0 @@
-# Training with Multiple GPUs Using Model Parallelism
-Training deep learning models can be resource intensive.
-Even with a powerful GPU, some models can take days or weeks to train.
-Large long short-term memory (LSTM) recurrent neural networks
-can be especially slow to train,
-with each layer, at each time step, requiring eight matrix multiplications.
-Fortunately, given cloud services like AWS,
-machine learning practitioners often  have access
-to multiple machines and multiple GPUs.
-One key strength of _MXNet_ is its ability to leverage
-powerful heterogeneous hardware environments to achieve significant speedups.
-
-There are two primary ways that we can spread a workload across multiple 
devices.
-In a previous document, [we addressed data parallelism](./multi_devices.md),
-an approach in which samples within a batch are divided among the available 
devices.
-With data parallelism, each device stores a complete copy of the model.
-Here, we explore _model parallelism_, a different approach.
-Instead of splitting the batch among the devices, we partition the model 
itself.
-Most commonly, we achieve model parallelism by assigning the parameters (and 
computation)
-of different layers of the network to different devices.
-
-In particular, we will focus on LSTM recurrent networks.
-LSTMS are powerful sequence models, that have proven especially useful
-for [natural language translation](https://arxiv.org/pdf/1409.0473.pdf), 
[speech recognition](https://arxiv.org/abs/1512.02595),
-and working with [time series data](https://arxiv.org/abs/1511.03677).
-For a general high-level introduction to LSTMs,
-see the excellent 
[tutorial](http://colah.github.io/posts/2015-08-Understanding-LSTMs/) by 
Christopher Olah.
-
-
-## Model Parallelism: Using Multiple GPUs As a Pipeline
-Model parallelism in deep learning was first proposed
-for the _extraordinarily large_ convolutional layer in GoogleNet.
-From this implementation, we take the idea of placing each layer on a separate 
GPU.
-Using model parallelism in such a layer-wise fashion
-provides the benefit that no GPU has to maintain all of the model parameters 
in memory.
-
-<img width="517" alt="screen shot 2016-05-06 at 10 13 16 pm" 
src="https://cloud.githubusercontent.com/assets/5545640/15089697/d6f4fca0-13d7-11e6-9331-7f94fcc7b4c6.png";>
-
-In the preceding figure, each LSTM layer is assigned to a different GPU.
-After GPU 1 finishes computing layer 1 for the first sentence, it passes its 
output to GPU 2.
-At the same time, GPU 1 fetches the next sentence and starts training.
-This differs significantly from data parallelism.
-Here, there is no contention to update the shared model at the end of each 
iteration,
-and most of the communication happens when passing intermediate results 
between GPUs.
-
-
-## Workload Partitioning
-
-Implementing model parallelism requires knowledge of the training task.
-Here are some general heuristics that we find useful:
-
-- To minimize communication time, place neighboring layers on the same GPUs.
-- Be careful to balance the workload between GPUs.
-- Remember that different kinds of layers have different computation-memory 
properties.
-
-<img width="449" alt="screen shot 2016-05-07 at 1 51 02 am" 
src="https://cloud.githubusercontent.com/assets/5545640/15090455/37a30ab0-13f6-11e6-863b-efe2b10ec2e6.png";>
-
-Let's take a quick look at the two pipelines in the preceding diagram.
-They both have eight layers with a decoder and an encoder layer.
-Based on our first principle, it's unwise to place all neighboring layers on 
separate GPUs.
-We also want to balance the workload across GPUs.
-Although the LSTM layers consume less memory than the decoder/encoder layers, 
they consume more computation time because of the dependency of the unrolled 
LSTM.
-Thus, the partition on the left will be faster than the one on the right
-because the workload is more evenly distributed.
-
-
-## Apply Bucketing to Model Parallelism
-
-To achieve model parallelism while using bucketing,
-you need to unroll an LSTM model for each bucket
-to obtain an executor for each.
-
-On the other hand, because model parallelism partitions the model/layers,
-the input data has to be transformed/transposed to the agreed shape.
-For more details, see 
[bucket_io](https://github.com/apache/incubator-mxnet/blob/master/example/rnn/old/bucket_io.py).
diff --git a/docs/faq/multi_devices.md b/docs/faq/multi_devices.md
index a43879cb523..75b9f8fec97 100644
--- a/docs/faq/multi_devices.md
+++ b/docs/faq/multi_devices.md
@@ -15,8 +15,6 @@ updated model are communicated across these devices.
 MXNet also supports model parallelism.
 In this approach, each device holds onto only part of the model.
 This proves useful when the model is too large to fit onto a single device.
-As an example, see the following [tutorial](./model_parallel_lstm.md)
-which shows how to use model parallelism for training a multi-layer LSTM model.
 In this tutorial, we'll focus on data parallelism.
 
 ## Multiple GPUs within a Single Machine


 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] ThomasDelteil closed pull request #12298: Remove model parallelism references from FAQ

Reply via email to