programming-model.md

wangwei Sun, 19 Jul 2015 08:19:24 -0700

Author: wangwei
Date: Sun Jul 19 15:18:56 2015
New Revision: 1691832

URL: http://svn.apache.org/r1691832
Log:
CMS commit to singa by wangwei


Modified:
    incubator/singa/site/trunk/content/markdown/docs/programming-model.md

Modified: incubator/singa/site/trunk/content/markdown/docs/programming-model.md
URL: 
http://svn.apache.org/viewvc/incubator/singa/site/trunk/content/markdown/docs/programming-model.md?rev=1691832&r1=1691831&r2=1691832&view=diff
==============================================================================
--- incubator/singa/site/trunk/content/markdown/docs/programming-model.md 
(original)
+++ incubator/singa/site/trunk/content/markdown/docs/programming-model.md Sun 
Jul 19 15:18:56 2015
@@ -1,125 +1,279 @@
-## Programming Model
+## Model Configuration
 
-We describe the programming model of SINGA in this article.
-Base data structures are introduced firstly, and then we show examples for
-users with different levels of deep learning background.
-
-### Base Data Structures
-
-#### Layer
-
-Layer is the first class citizen in SINGA. Users construct their deep learning
-models by creating layer objects and combining them. SINGA
-takes care of running BackPropagation (or Contrastive Divergence) algorithms
-to calculate the gradients for parameters and calling [Updaters](#updater) to
-update them.
-
-    class Layer{
-      /**
-       * Setup layer properties.
-       * Setup the shapes for data and parameters, also setup some properties
-       * based on the layer configuration and connected src layers.
-       * @param conf user defined layer configuration of type 
[LayerProto](#netproto)
-       * @param srclayers layers connecting to this layer
-       */
-      Setup(conf, srclayers);
-      /**
-       * Setup the layer properties.
-       * This function is called if the model is partitioned due to distributed
-       * training. Shape of the layer is already set by the partition 
algorithm,
-       * and is passed in to set other properties.
-       * @param conf user defined layer configuration of type 
[LayerProto](#netproto)
-       * @param shape shape set by partition algorithm (for distributed 
training).
-       * @param srclayers layers connecting to this layer
-       */
-      SetupAfterPartition(conf, shape, srclayers);
-      /**
-       * Compute features of this layer based on connected layers.
-       * BP and CD will call this to calculate gradients
-       * @param training boolean phase indicator for training or test
-       * @param srclayers layers connecting to this layer
-       */
-      ComputeFeature(training, srclayers);
-      /**
-       * Compute gradients for parameters and connected layers.
-       * BP and CD will call this to calculate gradients
-       * @param srclayers layers connecting to this layer.
-       */
-      ComputeGradient(srclayers)=0;
-    }
-
-The above pseudo code shows the base Layer class. Users override these
-methods to implement their own layer classes. For example, we have implemented
-popular layers like ConvolutionLayer, InnerProductLayer. We also provide a
-DataLayer which is a base layer for loading (and prefetching) data from disk 
or HDFS. A base ParserLayer
-is created for parsing the raw data and convert it into records that are 
recognizable by SINGA.
-
-#### NetProto
-
-Since deep learning models consist of multiple layers. The model structure 
includes
-the properties of each layer and the connections between layers. SINGA uses
-google protocol buffer for users to configure the model structure. The protocol
-buffer message for the model structure is defined as:
-
-    NetProto{
-      repeated LayerProto layer;
-    }
-
-    LayerProto{
-      string name; // user defined layer name for displaying
-      string type; // One layer class has a unique type.
-      repeated string srclayer_name; // connected layer names;
-      repeated ParamProto param; // parameter configurations
+SINGA uses the stochastic gradient descent (SGD) algorithm to train parameters 
of deep learning models.
+For each SGD iteration, there is a [Worker] computing gradients of parameters 
from the NeuralNet and 
+a [Updater] updating parameter values based on gradients. Hence the model 
configuration mainly consists
+these three parts. We will introduce the NeuralNet, Worker and Updater in the 
following paragraphs 
+and describe the configurations for them.
+
+
+## NeuralNet
+
+### Deep learning training
+
+Deep learning is labeled as a feature learning technique, which usually 
consists of multiple layers.
+Each layer is associated a feature transformation function. After going 
through all layers, 
+the raw input feature (e.g., pixels of images) would be converted into a 
high-level feature that is 
+easier for tasks like classification. 
+
+Training a deep learning model is to find the optimal parameters involved in 
the transformation functions 
+that generates good features for specific tasks. The goodness of a set of 
parameters is measured by 
+a loss function, e.g., [Cross-Entropy Loss]. Since the loss functions are 
usually non-linear and non-convex,
+it is difficult to get a closed form solution. Normally, people uses the SGD 
algorithm which randomly 
+initializes the parameters and then iteratively update them to reduce the loss.
+
+### Uniform model representation
+
+Many deep learning models have being proposed. Figure 1 is a categorization of 
popular deep learning models 
+based on the layer connections. The NeuralNet abstraction of SINGA consists of 
multiple directed
+connected layers. This abstraction is able to represent models from all the 
three categorizations.
+
+  *For the feed-forward models, their connections are already directed. 
+
+  *For the RNN models, we unroll them into directed connections, as shown in 
Figure 2.
+
+  *For the undirected connections in RBM, DBM, etc., we replace each 
undirected connection with two 
+   directed connection, as shown in Figure 3.
+
+In specific, the NeuralNet class is defined in [neuralnet.h] :
+
+    ...
+    vector<Layer*> layers_;
+    ...
+
+The Layer class is defined in [base_layer.h]:
+
+    vector<Layer*> srclayers_, dstlayers_;
+    LayerProto layer_proto_;  // layer configuration, including meta info, 
e.g., name
+    ... 
+
+
+The connection with other layers are kept in the `srclayers_` and 
`dstlayers_`. 
+Since there are many different feature transformations, there are many 
different [Layer implementations]
+correspondingly. For those layers which have parameters in their feature 
transformation functions,
+they would have Param instances in the layer class, e.g.,
+
+    Param weight;
+
+
+### Configure the structure of a NeuralNet instance
+
+To train a deep learning model, the first step is to write the configurations 
for the
+model structure, i.e., the layers and connections for the NeuralNet. Like 
Caffe, we use
+the [Google Protocol Buffer] to define the configuration schema, the NetProto 
specifies the
+configuration fields for a NeuralNet instance,
+
+message NetProto {
+  repeated LayerProto layer = 1;
+  ...
+}
+
+The configuration is then
+
+    layer {
+      // layer configuration
+    }
+    layer {
+      // layer configuration
+    }
+    ...
+
+To configure the model structure, we just configure each layer involved in the 
model.
+
+    message LayerProto {
+      // the layer name used for identification
+      required string name = 1;
+      // source layer names
+      repeated string srclayers = 3;
+      // parameters, e.g., weight matrix or bias vector
+      repeated ParamProto param = 12;
+      // the layer type from the enum above
+      required LayerType type = 20;
+      // configuration for convolution layer
+      optional ConvolutionProto convolution_conf = 30;
+      // configuration for concatenation layer
+      optional ConcateProto concate_conf = 31;
+      // configuration for dropout layer
+      optional DropoutProto dropout_conf = 33;
       ...
     }
 
-Users can create a plain text file and fill it with the configurations. SINGA
-parses it according to user provided path.
+A sample configuration for a feed-forward model is like
 
-#### Param
+    layer {
+      name : "data"
+      type : kDataShard
+    }
+    layer {
+      name : "image"
+      type : kImageParser
+      srclayers : "data"
+    }
+    layer {
+      name : "conv"
+      type : kConvolution
+      srclayers : "image"
+      param {
+        // configuration for parameter
+      }
+      conv_conf {
+        // configuration for convolution operations
+      }
+      ...
+    }
+    
+The layer type list is defined in [model.proto]. One type (kFoo) corresponds 
to one child 
+class of Layer (FooLayer) and one configuration field (foo_conf). SINGA will 
infer the dstlayers_
+of each layer after reading the configuration for all layers. Developers can 
implement new layers
+and update the type list, then users can use the layer. [layer] describes the 
configurations of 
+current built-in layers.
+
+Figure 4 shows the model structure corresponding to the neural network 
configuration in [cifar10/model.conf].
+
+
+## Worker
 
-The Param class is shown below. Users do not need to extend the Param class for
-most cases. We make it a base class just for future extension. For example,
-if a new initialization trick is proposed in the future, we can override the 
`Init`
-method to implement it.
-
-    Param{
-      /**
-       * Set properties of the parameter.
-       * @param conf user defined parameter configuration of type ParamProto
-       * @param shape shape of the parameter
-      Setup(conf, shape);
-      /**
-       * Initialize the data of the parameter.
-       /
-      Init();
-      ...// methods to handle synchronizations with parameter servers and 
other workers
+At the beginning, the Work will initialize the values of Param instances of 
each layer either randomly 
+(according to user configured distribution) or load from a [checkpoint file].
+For each training iteration, the worker visits layers of the neural network to 
compute gradients of 
+Param instances of each layer. Corresponding to the three categories of 
models, there are three 
+different algorithm to compute the gradients of a neural network. 
+
+  1. Back-propagation (BP) for feed-forward models
+  2. Back-propagation through time (BPTT) for recurrent neural networks
+  3. Contrastive divergence (CD) for RBM, DBM, etc models.
+
+SINGA has provided these three algorithms as three Worker implementations. 
Users only need to configure
+in the model.conf file to specify which algorithm should be used. The 
configuration protocol is
+
+    message ModelProto {
+      ...
+      enum GradCalcAlg {
+      // BP algorithm for feed-forward models, e.g., CNN, MLP, RNN
+      kBP = 1;
+      // BPTT for recurrent neural networks
+      kBPTT = 2;
+      // CD algorithm for RBM, DBM etc., models
+      kCd = 3;
+      }
+      // gradient calculation algorithm
+      required GradCalcAlg alg = 8 [default = kBackPropagation];
+      ...
     }
 
-#### Updater
+These algorithms override the TrainOneBatch function of the Worker, e.g., the 
BPWorker implement it as
+
+    void BPWorker::TrainOneBatch(int step, Metric* perf) {
+      Forward(step, kTrain, train_net_, perf);
+      Backward(step, train_net_);
+    }
 
-There are many SGD extensions for updating parameters,
+The Forward function pass the raw input features of one mini-batch through all 
layers, and the Backward
+function visits the layers in reverse order to compute the gradients of the 
loss w.r.t each layer's feature
+and each layer's Param objects. Different algorithms would visit the layers in 
different orders. Some may 
+traverses the neural network multiple times, e.g., the CDWorker's 
TrainOneBatch function is:
+
+    void CDWorker::TrainOneBatch(int step, Metric* perf) {
+      PostivePhase(step, kTrain, train_net_, perf);
+      NegativePhase(step, kTran, train_net_, perf);
+      GradientPhase(step, train_net_);
+    }
+
+But all algorithms will finally call the two functions of the Layer class:
+
+     /**
+      * Transform features from connected layers into features of this layer.
+      *
+      * @param phase kTrain, kTest, kPositive, etc.
+      */
+     virtual void ComputeFeature(Phase phase, Metric* perf) = 0;
+     /**
+      * Compute gradients for parameters (and connected layers).
+      *
+      * @param phase kTrain, kTest, kPositive, etc.
+      */
+     virtual void ComputeGradient(Phase phase) = 0;
+
+All Layer implementation must implement the above two functions. 
+
+
+## Updater
+
+Once the gradients of parameters are computed, the Updater will update 
parameter values.
+There are many SGD variants for updating parameters,
 like [AdaDelta](http://arxiv.org/pdf/1212.5701v1.pdf),
 [AdaGrad](http://www.magicbroom.info/Papers/DuchiHaSi10.pdf),
 
[RMSProp](http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf),
 
[Nesterov](http://scholar.google.com/citations?view_op=view_citation&amp;hl=en&amp;user=DJ8Ep8YAAAAJ&amp;citation_for_view=DJ8Ep8YAAAAJ:hkOj_22Ku90C)
-and SGD with momentum. We provide a base Updater to deal with these algorithms.
-New parameter updating algorithms can be added by extending the base Updater.
-
-    Updater{
-      /**
-      * @param proto user configuration for the updater.
-      Init(conf);
-      /**
-      * Update parameter based on its gradient
-      * @param step training step
-      * @param param the Param object
-      */
-      Update(step, param);
+and SGD with momentum. The core function of the Updater is
+    
+    /**
+     * Update parameter values based on gradients
+     * @param step training step
+     * @param param pointer to the Param object
+     * @param grad_scale scaling factor for the gradients
+     */
+    void Update(int step, Param* param, float grad_scale=1.0f);
+    /**
+     * @param step training step
+     * @return the learning rate for this step
+     */
+    float GetLearningRate(int step);
+
+SINGA provides several built-in updaters and learning rate change methods, 
users can configure them
+according the the [UpdaterProto]
+   
+    message UpdaterProto {
+      enum UpdaterType{
+        // noraml SGD with momentum and weight decay
+        kSGD = 1;
+        // adaptive subgradient, 
http://www.magicbroom.info/Papers/DuchiHaSi10.pdf
+        kAdaGrad = 2;
+        // 
http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf
+        kRMSProp = 3;
+        // Nesterov first optimal gradient method
+        kNesterov = 4;
+      }
+      // updater type
+      required UpdaterType type = 1 [default=kSGD];
+      // configuration for RMSProp algorithm
+      optional RMSPropProto rmsprop_conf = 50;
+
+      enum ChangeMethod {
+        kFixed = 0;
+        kInverseT = 1;
+        kInverse = 2;
+        kExponential = 3;
+        kLinear = 4;
+        kStep = 5;
+        kFixedStep = 6;
+      }
+      // change method for learning rate
+      required ChangeMethod lr_change= 2 [default = kFixed];
+  
+      optional FixedStepProto fixedstep_conf=40;
+      ... 
+      optional float momentum = 31 [default = 0];
+      optional float weight_decay = 32 [default = 0];
+      // base learning rate
+      optional float base_lr = 34 [default = 0];  
     }
 
-### Examples
 
-The [MLP example](..)
-shows how to configure the model through google protocol buffer.
+## Other model configuration fields
+
+Some other important configuration fields for training a deep learning model 
is listed:
+
+    // model name, e.g., "cifar10-dcnn", "mnist-mlp"
+    required string name = 1;
+    // frequency of displaying training info
+    required int32 display_frequency = 3 ;
+    // total num of steps for training
+    required int32 train_steps = 5;
+    ... // step, frequency for validation and test
+    // frequency of checkpoint
+    optional int32 checkpoint_frequency = 34 [default = 0];
+    // checkpoint path
+    optional bool resume = 36 [default = false];
 
+The pages of [checkpoint and restore], [validation and test] have more details 
on related fields.

svn commit: r1691832 - /incubator/singa/site/trunk/content/markdown/docs/programming-model.md

Reply via email to