Author: wangwei
Date: Sun Jul 19 15:18:56 2015
New Revision: 1691832
URL: http://svn.apache.org/r1691832
Log:
CMS commit to singa by wangwei
Modified:
incubator/singa/site/trunk/content/markdown/docs/programming-model.md
Modified: incubator/singa/site/trunk/content/markdown/docs/programming-model.md
URL:
http://svn.apache.org/viewvc/incubator/singa/site/trunk/content/markdown/docs/programming-model.md?rev=1691832&r1=1691831&r2=1691832&view=diff
==============================================================================
--- incubator/singa/site/trunk/content/markdown/docs/programming-model.md
(original)
+++ incubator/singa/site/trunk/content/markdown/docs/programming-model.md Sun
Jul 19 15:18:56 2015
@@ -1,125 +1,279 @@
-## Programming Model
+## Model Configuration
-We describe the programming model of SINGA in this article.
-Base data structures are introduced firstly, and then we show examples for
-users with different levels of deep learning background.
-
-### Base Data Structures
-
-#### Layer
-
-Layer is the first class citizen in SINGA. Users construct their deep learning
-models by creating layer objects and combining them. SINGA
-takes care of running BackPropagation (or Contrastive Divergence) algorithms
-to calculate the gradients for parameters and calling [Updaters](#updater) to
-update them.
-
- class Layer{
- /**
- * Setup layer properties.
- * Setup the shapes for data and parameters, also setup some properties
- * based on the layer configuration and connected src layers.
- * @param conf user defined layer configuration of type
[LayerProto](#netproto)
- * @param srclayers layers connecting to this layer
- */
- Setup(conf, srclayers);
- /**
- * Setup the layer properties.
- * This function is called if the model is partitioned due to distributed
- * training. Shape of the layer is already set by the partition
algorithm,
- * and is passed in to set other properties.
- * @param conf user defined layer configuration of type
[LayerProto](#netproto)
- * @param shape shape set by partition algorithm (for distributed
training).
- * @param srclayers layers connecting to this layer
- */
- SetupAfterPartition(conf, shape, srclayers);
- /**
- * Compute features of this layer based on connected layers.
- * BP and CD will call this to calculate gradients
- * @param training boolean phase indicator for training or test
- * @param srclayers layers connecting to this layer
- */
- ComputeFeature(training, srclayers);
- /**
- * Compute gradients for parameters and connected layers.
- * BP and CD will call this to calculate gradients
- * @param srclayers layers connecting to this layer.
- */
- ComputeGradient(srclayers)=0;
- }
-
-The above pseudo code shows the base Layer class. Users override these
-methods to implement their own layer classes. For example, we have implemented
-popular layers like ConvolutionLayer, InnerProductLayer. We also provide a
-DataLayer which is a base layer for loading (and prefetching) data from disk
or HDFS. A base ParserLayer
-is created for parsing the raw data and convert it into records that are
recognizable by SINGA.
-
-#### NetProto
-
-Since deep learning models consist of multiple layers. The model structure
includes
-the properties of each layer and the connections between layers. SINGA uses
-google protocol buffer for users to configure the model structure. The protocol
-buffer message for the model structure is defined as:
-
- NetProto{
- repeated LayerProto layer;
- }
-
- LayerProto{
- string name; // user defined layer name for displaying
- string type; // One layer class has a unique type.
- repeated string srclayer_name; // connected layer names;
- repeated ParamProto param; // parameter configurations
+SINGA uses the stochastic gradient descent (SGD) algorithm to train parameters
of deep learning models.
+For each SGD iteration, there is a [Worker] computing gradients of parameters
from the NeuralNet and
+a [Updater] updating parameter values based on gradients. Hence the model
configuration mainly consists
+these three parts. We will introduce the NeuralNet, Worker and Updater in the
following paragraphs
+and describe the configurations for them.
+
+
+## NeuralNet
+
+### Deep learning training
+
+Deep learning is labeled as a feature learning technique, which usually
consists of multiple layers.
+Each layer is associated a feature transformation function. After going
through all layers,
+the raw input feature (e.g., pixels of images) would be converted into a
high-level feature that is
+easier for tasks like classification.
+
+Training a deep learning model is to find the optimal parameters involved in
the transformation functions
+that generates good features for specific tasks. The goodness of a set of
parameters is measured by
+a loss function, e.g., [Cross-Entropy Loss]. Since the loss functions are
usually non-linear and non-convex,
+it is difficult to get a closed form solution. Normally, people uses the SGD
algorithm which randomly
+initializes the parameters and then iteratively update them to reduce the loss.
+
+### Uniform model representation
+
+Many deep learning models have being proposed. Figure 1 is a categorization of
popular deep learning models
+based on the layer connections. The NeuralNet abstraction of SINGA consists of
multiple directed
+connected layers. This abstraction is able to represent models from all the
three categorizations.
+
+ *For the feed-forward models, their connections are already directed.
+
+ *For the RNN models, we unroll them into directed connections, as shown in
Figure 2.
+
+ *For the undirected connections in RBM, DBM, etc., we replace each
undirected connection with two
+ directed connection, as shown in Figure 3.
+
+In specific, the NeuralNet class is defined in [neuralnet.h] :
+
+ ...
+ vector<Layer*> layers_;
+ ...
+
+The Layer class is defined in [base_layer.h]:
+
+ vector<Layer*> srclayers_, dstlayers_;
+ LayerProto layer_proto_; // layer configuration, including meta info,
e.g., name
+ ...
+
+
+The connection with other layers are kept in the `srclayers_` and
`dstlayers_`.
+Since there are many different feature transformations, there are many
different [Layer implementations]
+correspondingly. For those layers which have parameters in their feature
transformation functions,
+they would have Param instances in the layer class, e.g.,
+
+ Param weight;
+
+
+### Configure the structure of a NeuralNet instance
+
+To train a deep learning model, the first step is to write the configurations
for the
+model structure, i.e., the layers and connections for the NeuralNet. Like
Caffe, we use
+the [Google Protocol Buffer] to define the configuration schema, the NetProto
specifies the
+configuration fields for a NeuralNet instance,
+
+message NetProto {
+ repeated LayerProto layer = 1;
+ ...
+}
+
+The configuration is then
+
+ layer {
+ // layer configuration
+ }
+ layer {
+ // layer configuration
+ }
+ ...
+
+To configure the model structure, we just configure each layer involved in the
model.
+
+ message LayerProto {
+ // the layer name used for identification
+ required string name = 1;
+ // source layer names
+ repeated string srclayers = 3;
+ // parameters, e.g., weight matrix or bias vector
+ repeated ParamProto param = 12;
+ // the layer type from the enum above
+ required LayerType type = 20;
+ // configuration for convolution layer
+ optional ConvolutionProto convolution_conf = 30;
+ // configuration for concatenation layer
+ optional ConcateProto concate_conf = 31;
+ // configuration for dropout layer
+ optional DropoutProto dropout_conf = 33;
...
}
-Users can create a plain text file and fill it with the configurations. SINGA
-parses it according to user provided path.
+A sample configuration for a feed-forward model is like
-#### Param
+ layer {
+ name : "data"
+ type : kDataShard
+ }
+ layer {
+ name : "image"
+ type : kImageParser
+ srclayers : "data"
+ }
+ layer {
+ name : "conv"
+ type : kConvolution
+ srclayers : "image"
+ param {
+ // configuration for parameter
+ }
+ conv_conf {
+ // configuration for convolution operations
+ }
+ ...
+ }
+
+The layer type list is defined in [model.proto]. One type (kFoo) corresponds
to one child
+class of Layer (FooLayer) and one configuration field (foo_conf). SINGA will
infer the dstlayers_
+of each layer after reading the configuration for all layers. Developers can
implement new layers
+and update the type list, then users can use the layer. [layer] describes the
configurations of
+current built-in layers.
+
+Figure 4 shows the model structure corresponding to the neural network
configuration in [cifar10/model.conf].
+
+
+## Worker
-The Param class is shown below. Users do not need to extend the Param class for
-most cases. We make it a base class just for future extension. For example,
-if a new initialization trick is proposed in the future, we can override the
`Init`
-method to implement it.
-
- Param{
- /**
- * Set properties of the parameter.
- * @param conf user defined parameter configuration of type ParamProto
- * @param shape shape of the parameter
- Setup(conf, shape);
- /**
- * Initialize the data of the parameter.
- /
- Init();
- ...// methods to handle synchronizations with parameter servers and
other workers
+At the beginning, the Work will initialize the values of Param instances of
each layer either randomly
+(according to user configured distribution) or load from a [checkpoint file].
+For each training iteration, the worker visits layers of the neural network to
compute gradients of
+Param instances of each layer. Corresponding to the three categories of
models, there are three
+different algorithm to compute the gradients of a neural network.
+
+ 1. Back-propagation (BP) for feed-forward models
+ 2. Back-propagation through time (BPTT) for recurrent neural networks
+ 3. Contrastive divergence (CD) for RBM, DBM, etc models.
+
+SINGA has provided these three algorithms as three Worker implementations.
Users only need to configure
+in the model.conf file to specify which algorithm should be used. The
configuration protocol is
+
+ message ModelProto {
+ ...
+ enum GradCalcAlg {
+ // BP algorithm for feed-forward models, e.g., CNN, MLP, RNN
+ kBP = 1;
+ // BPTT for recurrent neural networks
+ kBPTT = 2;
+ // CD algorithm for RBM, DBM etc., models
+ kCd = 3;
+ }
+ // gradient calculation algorithm
+ required GradCalcAlg alg = 8 [default = kBackPropagation];
+ ...
}
-#### Updater
+These algorithms override the TrainOneBatch function of the Worker, e.g., the
BPWorker implement it as
+
+ void BPWorker::TrainOneBatch(int step, Metric* perf) {
+ Forward(step, kTrain, train_net_, perf);
+ Backward(step, train_net_);
+ }
-There are many SGD extensions for updating parameters,
+The Forward function pass the raw input features of one mini-batch through all
layers, and the Backward
+function visits the layers in reverse order to compute the gradients of the
loss w.r.t each layer's feature
+and each layer's Param objects. Different algorithms would visit the layers in
different orders. Some may
+traverses the neural network multiple times, e.g., the CDWorker's
TrainOneBatch function is:
+
+ void CDWorker::TrainOneBatch(int step, Metric* perf) {
+ PostivePhase(step, kTrain, train_net_, perf);
+ NegativePhase(step, kTran, train_net_, perf);
+ GradientPhase(step, train_net_);
+ }
+
+But all algorithms will finally call the two functions of the Layer class:
+
+ /**
+ * Transform features from connected layers into features of this layer.
+ *
+ * @param phase kTrain, kTest, kPositive, etc.
+ */
+ virtual void ComputeFeature(Phase phase, Metric* perf) = 0;
+ /**
+ * Compute gradients for parameters (and connected layers).
+ *
+ * @param phase kTrain, kTest, kPositive, etc.
+ */
+ virtual void ComputeGradient(Phase phase) = 0;
+
+All Layer implementation must implement the above two functions.
+
+
+## Updater
+
+Once the gradients of parameters are computed, the Updater will update
parameter values.
+There are many SGD variants for updating parameters,
like [AdaDelta](http://arxiv.org/pdf/1212.5701v1.pdf),
[AdaGrad](http://www.magicbroom.info/Papers/DuchiHaSi10.pdf),
[RMSProp](http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf),
[Nesterov](http://scholar.google.com/citations?view_op=view_citation&hl=en&user=DJ8Ep8YAAAAJ&citation_for_view=DJ8Ep8YAAAAJ:hkOj_22Ku90C)
-and SGD with momentum. We provide a base Updater to deal with these algorithms.
-New parameter updating algorithms can be added by extending the base Updater.
-
- Updater{
- /**
- * @param proto user configuration for the updater.
- Init(conf);
- /**
- * Update parameter based on its gradient
- * @param step training step
- * @param param the Param object
- */
- Update(step, param);
+and SGD with momentum. The core function of the Updater is
+
+ /**
+ * Update parameter values based on gradients
+ * @param step training step
+ * @param param pointer to the Param object
+ * @param grad_scale scaling factor for the gradients
+ */
+ void Update(int step, Param* param, float grad_scale=1.0f);
+ /**
+ * @param step training step
+ * @return the learning rate for this step
+ */
+ float GetLearningRate(int step);
+
+SINGA provides several built-in updaters and learning rate change methods,
users can configure them
+according the the [UpdaterProto]
+
+ message UpdaterProto {
+ enum UpdaterType{
+ // noraml SGD with momentum and weight decay
+ kSGD = 1;
+ // adaptive subgradient,
http://www.magicbroom.info/Papers/DuchiHaSi10.pdf
+ kAdaGrad = 2;
+ //
http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf
+ kRMSProp = 3;
+ // Nesterov first optimal gradient method
+ kNesterov = 4;
+ }
+ // updater type
+ required UpdaterType type = 1 [default=kSGD];
+ // configuration for RMSProp algorithm
+ optional RMSPropProto rmsprop_conf = 50;
+
+ enum ChangeMethod {
+ kFixed = 0;
+ kInverseT = 1;
+ kInverse = 2;
+ kExponential = 3;
+ kLinear = 4;
+ kStep = 5;
+ kFixedStep = 6;
+ }
+ // change method for learning rate
+ required ChangeMethod lr_change= 2 [default = kFixed];
+
+ optional FixedStepProto fixedstep_conf=40;
+ ...
+ optional float momentum = 31 [default = 0];
+ optional float weight_decay = 32 [default = 0];
+ // base learning rate
+ optional float base_lr = 34 [default = 0];
}
-### Examples
-The [MLP example](..)
-shows how to configure the model through google protocol buffer.
+## Other model configuration fields
+
+Some other important configuration fields for training a deep learning model
is listed:
+
+ // model name, e.g., "cifar10-dcnn", "mnist-mlp"
+ required string name = 1;
+ // frequency of displaying training info
+ required int32 display_frequency = 3 ;
+ // total num of steps for training
+ required int32 train_steps = 5;
+ ... // step, frequency for validation and test
+ // frequency of checkpoint
+ optional int32 checkpoint_frequency = 34 [default = 0];
+ // checkpoint path
+ optional bool resume = 36 [default = false];
+The pages of [checkpoint and restore], [validation and test] have more details
on related fields.