[GitHub] ThomasDelteil commented on a change in pull request #13094: WIP: Simplifications and some fun stuff for the MNIST Gluon tutorial

GitBox Wed, 28 Nov 2018 14:57:42 -0800

ThomasDelteil commented on a change in pull request #13094: WIP: 
Simplifications and some fun stuff for the MNIST Gluon tutorial
URL: https://github.com/apache/incubator-mxnet/pull/13094#discussion_r237290087


 ##########
 File path: docs/tutorials/gluon/mnist.md
 ##########
 @@ -1,333 +1,434 @@
-# Handwritten Digit Recognition
+# Hand-written Digit Recognition
 
-In this tutorial, we'll give you a step by step walk-through of how to build a 
hand-written digit classifier using the 
[MNIST](https://en.wikipedia.org/wiki/MNIST_database) dataset.
+In this tutorial, we'll give you a step-by-step walkthrough of building a 
hand-written digit classifier using the 
[MNIST](https://en.wikipedia.org/wiki/MNIST_database) dataset.
 
-MNIST is a widely used dataset for the hand-written digit classification task. 
It consists of 70,000 labeled 28x28 pixel grayscale images of hand-written 
digits. The dataset is split into 60,000 training images and 10,000 test 
images. There are 10 classes (one for each of the 10 digits). The task at hand 
is to train a model using the 60,000 training images and subsequently test its 
classification accuracy on the 10,000 test images.
+MNIST is a widely used dataset for the hand-written digit classification task. 
It consists of 70,000 labeled grayscale images of hand-written digits, each 
28x28 pixels in size. The dataset is split into 60,000 training images and 
10,000 test images. There are 10 classes (one for each of the 10 digits). The 
task at hand is to train a model that can correctly classify the images into 
the digits they represent. The 60,000 training images are used to fit the 
model, and its performance in terms of classification accuracy is subsequently 
validated on the 10,000 test images.
 
 
![png](https://raw.githubusercontent.com/dmlc/web-data/master/mxnet/example/mnist.png)
 
 **Figure 1:** Sample images from the MNIST dataset.
 
-This tutorial uses MXNet's new high-level interface, gluon package to 
implement MLP using
-imperative fashion.
-
-This is based on the Mnist tutorial with symbolic approach. You can find it 
[here](http://mxnet.io/tutorials/python/mnist.html).
+This tutorial uses MXNet's high-level *Gluon* interface to implement neural 
networks in an imperative fashion. It is based on [the corresponding tutorial 
written with the symbolic 
approach](https://mxnet.incubator.apache.org/tutorials/python/mnist.html).
 
 ## Prerequisites
-To complete this tutorial, we need:
 
-- MXNet. See the instructions for your operating system in [Setup and 
Installation](http://mxnet.io/install/index.html).
+To complete this tutorial, you need:
 
-- [Python Requests](http://docs.python-requests.org/en/master/) and [Jupyter 
Notebook](http://jupyter.org/index.html).
+- MXNet. See the instructions for your operating system in [Setup and 
Installation](https://mxnet.incubator.apache.org/install/index.html).
+- The Python [`requests`](http://docs.python-requests.org/en/master/) library.
+- (Optional) The [Jupyter Notebook](https://jupyter.org/index.html) software 
for interactively running the provided `.ipynb` file.
 
 ```
 $ pip install requests jupyter
 ```
 
 ## Loading Data
 
-Before we define the model, let's first fetch the 
[MNIST](http://yann.lecun.com/exdb/mnist/) dataset.
+The following code downloads the MNIST dataset to the default location 
(`.mxnet/datasets/mnist/` in your home directory) and creates `Dataset` objects 
`train_data` and `val_data` for training and validation, respectively.
+These objects can later be used to get one image or a batch of images at a 
time, together with their corresponding labels.
 
-The following source code downloads and loads the images and the corresponding 
labels into memory.
+We also immediately apply the `transform_first()` method and supply a function 
that moves the channel axis of the images to the beginning (`(28, 28, 1) -> (1, 
28, 28)`), casts them to `float32` and rescales them from `[0, 255]` to `[0, 
1]`.
+The name `transform_first` reflects the fact that these datasets contain 
images and labels, and that the transform should only be applied to the first 
of each `(image, label)` pair.
 
 ```python
 import mxnet as mx
 
-# Fixing the random seed
+# Select a fixed random seed for reproducibility
 mx.random.seed(42)
 
-mnist = mx.test_utils.get_mnist()
+def data_xform(data):
+    """Move channel axis to the beginning, cast to float32, and normalize to 
[0, 1]."""
+    return nd.moveaxis(data, 2, 0).astype('float32') / 255
+
+train_data = mx.gluon.data.vision.MNIST(train=True).transform_first(data_xform)
+val_data = mx.gluon.data.vision.MNIST(train=False).transform_first(data_xform)
 ```
 
-After running the above source code, the entire MNIST dataset should be fully 
loaded into memory. Note that for large datasets it is not feasible to pre-load 
the entire dataset first like we did here. What is needed is a mechanism by 
which we can quickly and efficiently stream data directly from the source. 
MXNet Data iterators come to the rescue here by providing exactly that. Data 
iterator is the mechanism by which we feed input data into an MXNet training 
algorithm and they are very simple to initialize and use and are optimized for 
speed. During training, we typically process training samples in small batches 
and over the entire training lifetime will end up processing each training 
example multiple times. In this tutorial, we'll configure the data iterator to 
feed examples in batches of 100. Keep in mind that each example is a 28x28 
grayscale image and the corresponding label.
+Since the MNIST dataset is relatively small, the `MNIST` class loads it into 
memory all at once, but for larger datasets like ImageNet, this would no longer 
be possible.
+The Gluon `Dataset` class from which `MNIST` derives supports both cases.
+In general, `Dataset` and `DataLoader` (which we will encounter next) are the 
machinery in MXNet that provides a stream of input data to be consumed by a 
training algorithm, typically in batches of multiple data entities at once for 
better efficiency.
+In this tutorial, we will configure the data loader to feed examples in 
batches of 100.
+
+An image batch is commonly represented as a 4-D array with shape `(batch_size, 
num_channels, height, width)`.
+This convention is denoted by "BCHW", and it is the default in MXNet.
+For the MNIST dataset, each image has a size of 28x28 pixels and one color 
channel (grayscale), hence the shape of an input batch will be `(batch_size, 1, 
28, 28)`.
 
-Image batches are commonly represented by a 4-D array with shape `(batch_size, 
num_channels, width, height)`. For the MNIST dataset, since the images are 
grayscale, there is only one color channel. Also, the images are 28x28 pixels, 
and so each image has width and height equal to 28. Therefore, the shape of 
input is `(batch_size, 1, 28, 28)`. Another important consideration is the 
order of input samples. When feeding training examples, it is critical that we 
don't feed samples with the same label in succession. Doing so can slow down 
training.
-Data iterators take care of this by randomly shuffling the inputs. Note that 
we only need to shuffle the training data. The order does not matter for test 
data.
+Another important consideration is the order of input samples.
+When feeding training examples, it is critical not feed samples with the same 
label in succession since doing so can slow down training.
+Data iterators take care of this issue by randomly shuffling the inputs.
+Note that we only need to shuffle the training data -- for validation data, 
the order does not matter.
 
-The following source code initializes the data iterators for the MNIST 
dataset. Note that we initialize two iterators: one for train data and one for 
test data.
+The following code initializes the data iterators for the MNIST dataset.
 
 ```python
 batch_size = 100
-train_data = mx.io.NDArrayIter(mnist['train_data'], mnist['train_label'], 
batch_size, shuffle=True)
-val_data = mx.io.NDArrayIter(mnist['test_data'], mnist['test_label'], 
batch_size)
+train_loader = mx.gluon.data.DataLoader(train_data, shuffle=True, 
batch_size=batch_size)
+val_loader = mx.gluon.data.DataLoader(val_data, shuffle=False, 
batch_size=batch_size)
 ```
 
 ## Approaches
 
-We will cover a couple of approaches for performing the hand written digit 
recognition task. The first approach makes use of a traditional deep neural 
network architecture called Multilayer Perceptron (MLP). We'll discuss its 
drawbacks and use that as a motivation to introduce a second more advanced 
approach called Convolution Neural Network (CNN) that has proven to work very 
well for image classification tasks.
+We will cover two approaches for performing the hand-written digit recognition 
task.
+In our first attempt, we will make use of a traditional neural network 
architecture called [Multilayer Perceptron 
(MLP)](https://en.wikipedia.org/wiki/Multilayer_perceptron).
+Although this architecture lets us achieve about 95.5 % accuracy on the 
validation set, we will recognize and discuss some of its drawbacks and use 
them as a motivation for using a different network.
+In the subsequent second attempt, we introduce the more advanced and very 
widely used [Convolutional Neural Network 
(CNN)](https://en.wikipedia.org/wiki/Convolutional_neural_network) architecture 
that has proven to work very well for image classification tasks.
 
-Now, let's import required nn modules
+As a first step, we run some convenience imports of frequently used modules.
 
 ```python
-from __future__ import print_function
+from __future__ import print_function  # only relevant for Python 2
 import mxnet as mx
-from mxnet import gluon
+from mxnet import nd, gluon, autograd
 from mxnet.gluon import nn
-from mxnet import autograd as ag
 ```
 
-### Define a network: Multilayer Perceptron
+### Defining a network: Multilayer Perceptron
 
-The first approach makes use of a [Multilayer 
Perceptron](https://en.wikipedia.org/wiki/Multilayer_perceptron) to solve this 
problem. We'll define the MLP using MXNet's imperative approach.
+MLPs consist of several fully connected layers.
+In a fully connected (short: FC) layer, each neuron is connected to every 
neuron in its preceding layer.
+From a linear algebra perspective, an FC layer applies an [affine 
transform](https://en.wikipedia.org/wiki/Affine_transformation) *Y = X W + b* 
to an input matrix *X* of size (*n x m*) and outputs a matrix *Y* of size (*n x 
k*).
+The number *k*, also referred to as *hidden size*, corresponds to the number 
of neurons in the FC layer.
+An FC layer has two learnable parameters: the (*m x k*) weight matrix *W* and 
the (*1 x k*) bias vector *b*.
 
-MLPs consist of several fully connected layers. A fully connected layer or FC 
layer for short, is one where each neuron in the layer is connected to every 
neuron in its preceding layer. From a linear algebra perspective, an FC layer 
applies an [affine 
transform](https://en.wikipedia.org/wiki/Affine_transformation) to the *n x m* 
input matrix *X* and outputs a matrix *Y* of size *n x k*, where *k* is the 
number of neurons in the FC layer. *k* is also referred to as the hidden size. 
The output *Y* is computed according to the equation *Y = W X + b*. The FC 
layer has two learnable parameters, the *m x k* weight matrix *W* and the *m x 
1* bias vector *b*.
+In an MLP, the outputs of FC layers are typically fed into an activation 
function that applies an elementwise nonlinearity.
+This step is crucial since it gives neural networks the ability to classify 
inputs that are not linearly separable.
+Common choices for activation functions are 
[sigmoid](https://en.wikipedia.org/wiki/Sigmoid_function), [hyperbolic tangent 
("tanh")](https://en.wikipedia.org/wiki/Hyperbolic_function#Definitions), and 
[rectified linear unit 
(ReLU)](https://en.wikipedia.org/wiki/Rectifier_(neural_networks)).
+In this example, we'll use the ReLU activation function since it has several 
nice properties that make it a good default choice.
 
-In an MLP, the outputs of most FC layers are fed into an activation function, 
which applies an element-wise non-linearity. This step is critical and it gives 
neural networks the ability to classify inputs that are not linearly separable. 
Common choices for activation functions are sigmoid, tanh, and [rectified 
linear unit](https://en.wikipedia.org/wiki/Rectifier_%28neural_networks%29) 
(ReLU). In this example, we'll use the ReLU activation function which has 
several desirable properties and is typically considered a default choice.
+The following code snippet declares three fully connected (or *dense*) layers 
with 128, 64 and 10 neurons each, where the last number of neurons matches the 
number of output classes in our dataset.
+Note that the last layer uses no activation function since the 
[softmax](https://mxnet.incubator.apache.org/api/python/ndarray/ndarray.html#mxnet.ndarray.softmax)
 activation will be implicitly applied by the loss function later on.
+To build the neural network, we use a [`Sequential` 
layer](https://mxnet.incubator.apache.org/api/python/gluon/gluon.html#mxnet.gluon.nn.Sequential),
 which is a convenience class to build a linear stack of layers, often called a 
*feed-forward neural net*.
 
-The following code declares three fully connected layers with 128, 64 and 10 
neurons each.
-The last fully connected layer often has its hidden size equal to the number 
of output classes in the dataset. Furthermore, these FC layers uses ReLU 
activation for performing an element-wise ReLU transformation on the FC layer 
output.
-
-To do this, we will use [Sequential 
layer](http://mxnet.io/api/python/gluon/gluon.html#mxnet.gluon.nn.Sequential) 
type. This is simply a linear stack of neural network layers. `nn.Dense` layers 
are nothing but the fully connected layers we discussed above.
+**Note**: using the `name_scope()` context manager is optional.
+It is, however, good practice since it uses a common prefix for the names of 
all layers generated in that scope, which can be very helpful during debugging.
 
 ```python
-# define network
-net = nn.Sequential()
+net = nn.Sequential('MLP')
 with net.name_scope():
-    net.add(nn.Dense(128, activation='relu'))
-    net.add(nn.Dense(64, activation='relu'))
-    net.add(nn.Dense(10))
+    net.add(
+        nn.Flatten(),
+        nn.Dense(128, activation='relu'),
+        nn.Dense(64, activation='relu'),
+        nn.Dense(10, activation='relu')
+    )
 ```
 
-#### Initialize parameters and optimizer
-
-The following source code initializes all parameters received from parameter 
dict using 
[Xavier](http://mxnet.io/api/python/optimization/optimization.html#mxnet.initializer.Xavier)
 initializer
-to train the MLP network we defined above.
+#### Initializing parameters and optimizer
 
-For our training, we will make use of the stochastic gradient descent (SGD) 
optimizer. In particular, we'll be using mini-batch SGD. Standard SGD processes 
train data one example at a time. In practice, this is very slow and one can 
speed up the process by processing examples in small batches. In this case, our 
batch size will be 100, which is a reasonable choice. Another parameter we 
select here is the learning rate, which controls the step size the optimizer 
takes in search of a solution. We'll pick a learning rate of 0.02, again a 
reasonable choice. Settings such as batch size and learning rate are what are 
usually referred to as hyper-parameters. What values we give them can have a 
great impact on training performance.
+Before the network can be used, its parameters (weights and biases) need to be 
set to initial values that are sufficiently random while keeping the magnitude 
of gradients limited.
+The 
[Xavier](https://mxnet.incubator.apache.org/api/python/optimization/optimization.html#mxnet.initializer.Xavier)
 initializer is usually a good default choice.
 
-We will use [Trainer](http://mxnet.io/api/python/gluon/gluon.html#trainer) 
class to apply the
-[SGD 
optimizer](http://mxnet.io/api/python/optimization/optimization.html#mxnet.optimizer.SGD)
 on the
-initialized parameters.
+Since the `net.initialize()` method creates arrays for its parameters, it 
needs to know where to store the values: in CPU or GPU memory.
+Like many other functions and classes that deal with memory management in one 
way or another, the `initialize()` method takes an optional `ctx` (short for 
*context*) argument, where the return value of either `mx.cpu()` or `mx.gpu()` 
can be provided.
 
 ```python
-gpus = mx.test_utils.list_gpus()
-ctx =  [mx.gpu()] if gpus else [mx.cpu(0), mx.cpu(1)]
-net.initialize(mx.init.Xavier(magnitude=2.24), ctx=ctx)
-trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': 0.02})
+ctx = mx.gpu(0) if mx.context.num_gpus() > 0 else mx.cpu(0)
+net.initialize(mx.init.Xavier(), ctx=ctx)
 ```
 
-#### Train the network
+To train the network parameters, we will make use of the [stochastic gradient 
descent (SGD)](https://en.wikipedia.org/wiki/Stochastic_gradient_descent) 
optimizer.
+More specifically, we use mini-batch SGD in contrast to the classical SGD that 
processes one example at a time, which is very slow in practice.
+(Recall that we set the batch size to 100 in the ["Loading 
Data"](#loading-data) part.)
 
-Typically, one runs the training until convergence, which means that we have 
learned a good set of model parameters (weights + biases) from the train data. 
For the purpose of this tutorial, we'll run training for 10 epochs and stop. An 
epoch is one full pass over the entire train data.
+Besides the batch size, the SGD algorithm has one important *hyperparameter*: 
the *learning rate*.
+It determines the size of steps that the algorithm takes in search of 
parameters that allow the network to optimally fit the training data.
+Therefore, this value has great influence on both the course of the training 
process and its final outcome.
+In general, hyperparameters refer to *non-learnable* values that need to be 
chosen before training and that have a potential effect on the outcome.
+In this example, further hyperparameters are the number of layers in the 
network, the number of neurons of the first two layers, the activation function 
and (later) the loss function.
 
-We will take following steps for training:
+The SGD optimization method can be accessed in MXNet Gluon through the 
[`Trainer`](https://mxnet.incubator.apache.org/api/python/gluon/gluon.html#trainer)
 class.
+Internally, it makes use of the 
[`SGD`](https://mxnet.incubator.apache.org/api/python/optimization/optimization.html#mxnet.optimizer.SGD)
 optimizer class.
 
-- Define [Accuracy evaluation 
metric](http://mxnet.io/api/python/metric/metric.html#mxnet.metric.Accuracy) 
over training data.
-- Loop over inputs for every epoch.
-- Forward input through network to get output.
-- Compute loss with output and label inside record scope.
-- Backprop gradient inside record scope.
-- Update evaluation metric and parameters with gradient descent.
+```python
+trainer = gluon.Trainer(
+    params=net.collect_params(),
+    optimizer='sgd',
+    optimizer_params={'learning_rate': 0.02},
+)
+```
+
+#### Training
 
-Loss function takes (output, label) pairs and computes a scalar loss for each 
sample in the mini-batch. The scalars measure how far each output is from the 
label.
-There are many predefined loss functions in gluon.loss. Here we use
-[softmax_cross_entropy_loss](http://mxnet.io/api/python/gluon/gluon.html#mxnet.gluon.loss.softmax_cross_entropy_loss)
 for digit classification. We will compute loss and do backward propagation 
inside
-training scope which is defined by `autograd.record()`.
+Training the network requires a way to tell how well the network currently 
fits the training data.
+Following common practice in optimization, this quality of fit is expressed 
through a *loss value* (also referred to as badness-of-fit or data 
discrepancy), which the algorithm then tries to minimize by adjusting the 
weights of the model.
+
+Ideally, in a classification task, we would like to use the prediction 
inaccuracy, i.e., the fraction of incorrectly classified samples, to guide the 
training to a lower value.
+Unfortunately, inaccuracy is a poor choice for training since it contains 
almost no information that can be used to update the network parameters (its 
gradient is zero almost everywhere).
+As a better behaved proxy for inaccuracy, the [softmax cross-entropy 
loss](https://mxnet.incubator.apache.org/api/python/gluon/loss.html#mxnet.gluon.loss.SoftmaxCrossEntropyLoss)
 is a popular choice.
+It has the essential property of being minimal for the correct prediction, but 
at the same time, it is everywhere differentiable with nonzero gradient.
+The 
[accuracy](https://mxnet.incubator.apache.org/api/python/metric/metric.html#mxnet.metric.Accuracy)
 metric is still useful for monitoring the training progress, since it is more 
intuitively interpretable than a loss value.
 
 ```python
-%%time
-epoch = 10
-# Use Accuracy as the evaluation metric.
 metric = mx.metric.Accuracy()
-softmax_cross_entropy_loss = gluon.loss.SoftmaxCrossEntropyLoss()
-for i in range(epoch):
-    # Reset the train data iterator.
-    train_data.reset()
-    # Loop over the train data iterator.
-    for batch in train_data:
-        # Splits train data into multiple slices along batch_axis
-        # and copy each slice into a context.
-        data = gluon.utils.split_and_load(batch.data[0], ctx_list=ctx, 
batch_axis=0)
-        # Splits train labels into multiple slices along batch_axis
-        # and copy each slice into a context.
-        label = gluon.utils.split_and_load(batch.label[0], ctx_list=ctx, 
batch_axis=0)
-        outputs = []
-        # Inside training scope
-        with ag.record():
-            for x, y in zip(data, label):
-                z = net(x)
-                # Computes softmax cross entropy loss.
-                loss = softmax_cross_entropy_loss(z, y)
-                # Backpropagate the error for one iteration.
-                loss.backward()
-                outputs.append(z)
-        # Updates internal evaluation
-        metric.update(label, outputs)
-        # Make one step of parameter update. Trainer needs to know the
-        # batch size of data to normalize the gradient by 1/batch_size.
-        trainer.step(batch.data[0].shape[0])
-    # Gets the evaluation result.
+loss_function = gluon.loss.SoftmaxCrossEntropyLoss()
+```
+
+Typically, the training is run until convergence, which means that further 
iterations will no longer lead to improvements of the loss function, and that 
the network has probably learned a good set of model parameters from the train 
data.
+For the purpose of this tutorial, we only loop 10 times over the entire 
dataset.
+One such pass over the data is usually called an *epoch*.
+
+The following steps are taken in each `epoch`:
+
+- Get a minibatch of `inputs` and `labels` from the `train_loader`.
+- Feed the `inputs` to the network, producing `outputs`.
+- Compute the minibatch `loss` value by comparing `outputs` to `labels`.
+- Backpropagate the gradients to update the network parameters by calling 
`loss.backward()`.
+- Print the current accuracy over the training data, i.e., the fraction of 
correctly classified training examples.
+
+```python
+num_epochs = 10
+
+for epoch in range(num_epochs):
+    for inputs, labels in train_loader:
+        # Possibly copy inputs and labels to the GPU
+        inputs = inputs.as_in_context(ctx)
+        labels = labels.as_in_context(ctx)
+
+        # The forward pass and the loss computation need to be wrapped
+        # in a `record()` scope to indicate that the results will
 
 Review comment:
   The forward pass and the loss computation need to be wrapped in a `record()` 
scope to make sure the computational graph is recorded in order to 
automatically compute the gradients during the backward pass.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] ThomasDelteil commented on a change in pull request #13094: WIP: Simplifications and some fun stuff for the MNIST Gluon tutorial

Reply via email to