svn commit: r1700722 [3/3] - /incubator/singa/site/trunk/content/markdown/docs/

jinyang Wed, 02 Sep 2015 01:00:07 -0700

Modified: incubator/singa/site/trunk/content/markdown/docs/rnn.md
URL: 
http://svn.apache.org/viewvc/incubator/singa/site/trunk/content/markdown/docs/rnn.md?rev=1700722&r1=1700721&r2=1700722&view=diff
==============================================================================
--- incubator/singa/site/trunk/content/markdown/docs/rnn.md (original)
+++ incubator/singa/site/trunk/content/markdown/docs/rnn.md Wed Sep  2 07:59:20 
2015
@@ -1,19 +1,85 @@
-## Recurrent neural networks (RNN)
+---
+layout: post
+title: Example --- Recurrent Neural Network
+category : docs
+tags : [rnn, example]http://singa.incubator.apache.org
+{% include JB/setup %}
 
-Example files for RNN can be found in "SINGA_ROOT/examples/rnnlm", which we 
assume to be WORKSPACE.
 
-### Create DataShard
+Recurrent Neural Networks (RNN) are widely used for modeling sequential data,
+such as music, videos and sentences.  In this example, we use SINGA to train a
+[RNN 
model](http://www.fit.vutbr.cz/research/groups/speech/publi/2010/mikolov_interspeech2010_IS100722.pdf)
+proposed by Tomas Mikolov for [language 
modeling](https://en.wikipedia.org/wiki/Language_model).
+The training objective (loss) is
+minimize the [perplexity per word](https://en.wikipedia.org/wiki/Perplexity), 
which
+is equivalent to maximize the probability of predicting the next word given 
the current word in
+a sentence.
 
-(a) Define your own record. Please refer to [Data Preparation][1] for details.
+Different to the [CNN](http://singa.incubator.apache.org/docs/cnn), 
[MLP](http://singa.incubator.apache.org/docs/mlp)
+and [RBM](http://singa.incubator.apache.org/docs/rbm) examples which use 
built-in
+[Layer](http://singa.incubator.apache.org/docs/layer)s and 
[Record](http://singa.incubator.apache.org/docs/data)s,
+none of the layers in this model is built-in. Hence users can get examples of
+implementing their own Layers and data Records in this page.
 
-Records for RNN example are defined in "user.proto" as an extension.
+## Running instructions
 
+In *SINGA_ROOT/examples/rnn/*, scripts are provided to run the training job.
+First, the data is prepared by
+
+    $ cp Makefile.example Makefile
+    $ make download
+    $ make create
+
+Second, the training is started by passing the job configuration as,
+
+    # in SINGA_ROOT
+    $ ./bin/singa-run.sh -conf SINGA_ROOT/examples/rnn/job.conf
+
+
+
+## Implementations
+
+<img src="http://singa.incubator.apache.org/assets/image/rnn-refine.png"; 
align="center" width="300px"/>
+<span><strong>Figure 1 - Net structure of the RNN model.</strong></span>
+
+The neural net structure is shown Figure 1.
+Word records are loaded by `RnnlmDataLayer` from `WordShard`. 
`RnnlmWordparserLayer`
+parses word records to get word indexes (in the vocabulary). For every 
iteration,
+`window_size` words are processed. `RnnlmWordinputLayer` looks up a word
+embedding matrix to extract feature vectors for words in the window.
+These features are transformed by `RnnlmInnerproductLayer` layer and 
`RnnlmSigmoidLayer`.
+`RnnlmSigmoidLayer` is a recurrent layer that forwards features from previous 
words
+to next words.  Finally, `RnnlmComputationLayer` computes the perplexity loss 
with
+word class information from `RnnlmClassparserLayer`. The word class is a 
cluster ID.
+Words are clustered based on their frequency in the dataset, e.g., frequent 
words
+are clustered together and less frequent words are clustered together. 
Clustering
+is to improve the efficiency of the final prediction process.
+
+### Data preparation
+
+We use a small dataset in this example. In this dataset, [dataset description, 
e.g., format].
+The subsequent steps follow the instructions in
+[Data Preparation](http://singa.incubator.apache.org/docs/data) to convert the
+raw data into `Record`s and insert them into `DataShard`s.
+
+#### Download source data
+
+    # in SINGA_ROOT/examples/rnn/
+    wget http://www.fit.vutbr.cz/~imikolov/rnnlm/simple-examples.tgz
+    xxx
+
+
+#### Define your own record.
+
+Since this dataset has different format as the built-in 
`SingleLabelImageRecord`,
+we need to extend the base `Record` to add new fields,
+
+    # in SINGA_ROOT/examples/rnn/user.proto
     package singa;
 
-    import "common.proto";  // Record message for SINGA is defined
-    import "job.proto";     // Layer message for SINGA is defined
+    import "common.proto";  // import SINGA Record
 
-    extend Record {
+    extend Record {  // extend base Record to include users' records
         optional WordClassRecord wordclass = 101;
         optional SingleWordRecord singleword = 102;
     }
@@ -30,24 +96,94 @@ Records for RNN example are defined in "
         optional int32 class_index = 3;   // the index of the class 
corresponding to this word
     }
 
-(b) Download raw data
 
-This example downloads rnnlm-0.4b from [www.rnnlm.org][2] by a command 
+#### Create data shard for training and testing
+
+{% comment %}
+As the vocabulary size is very large, the original perplexity calculation 
method
+is time consuming. Because it has to calculate the probabilities of all 
possible
+words for
 
-    make download
+    p(wt|w0, w1, ... wt-1).
 
-The raw data is stored in a folder "rnnlm-0.4b/train" and "rnnlm-0.4b/test".
 
-(c) Create data shard for training and testing
+Tomas proposed to divide all
+words into different classes according to the word frequency, and compute the
+perplexity according to
 
-Data shards (e.g., "shard.dat") will be created in "rnnlm_class_shard", 
"rnnlm_vocab_shard", "rnnlm_word_shard_train" and "rnnlm_word_shard_test" by a 
command
+    p(wt|w0, w1, ... wt-1) = p(c|w0,w1,..wt-1) p(w|c)
+
+where `c` is the word class, `w0, w1...wt-1` are the previous words before 
`wt`.
+The probabilities on the right side can be computed faster than
+
+
+[Makefile](https://github.com/kaiping/incubator-singa/blob/rnnlm/examples/rnnlm/Makefile)
+for creating the shards (see in
+  
[create_shard.cc](https://github.com/kaiping/incubator-singa/blob/rnnlm/examples/rnnlm/create_shard.cc)),
+  we need to specify where to download the source data, number of classes we
+  want to divide all occurring words into, and all the shards together with
+  their names, directories we want to create.
+{% endcomment %}
+
+*SINGA_ROOT/examples/rnn/create_shard.cc* defines the following function for 
creating data shards,
+
+    void create_shard(const char *input, int nclass) {
+
+`input` is the path to [the text file], `nclass` is user specified cluster 
size.
+This function starts with
+
+      using StrIntMap = std::map<std::string, int>;
+      StrIntMap *wordIdxMapPtr;        //      Mapping word string to a word 
index
+      StrIntMap *wordClassIdxMapPtr;   //      Mapping word string to a word 
class index
+      if (-1 == nclass) {
+          loadClusterForNonTrainMode(input, nclass, &wordIdxMap, 
&wordClassIdxMap); // non-training phase
+      } else {
+          doClusterForTrainMode(input, nclass, &wordIdxMap, &wordClassIdxMap); 
// training phase
+      }
+
+
+  * If `-1 == nclass`, `path` points to the training data file.  
`doClusterForTrainMode`
+  reads all the words in the file to create the two maps. [The two maps are 
stored in xxx]
+  * otherwise, `path` points to either test or validation data file. 
`loadClusterForNonTrainMode`
+  loads the two maps from [xxx].
+
+Words from training/text/validation files are converted into `Record`s by
+
+      singa::SingleWordRecord *wordRecord = 
record.MutableExtension(singa::singleword);
+      while (in >> word) {
+        wordRecord->set_word(word);
+        wordRecord->set_word_index(wordIdxMap[word]);
+        wordRecord->set_class_index(wordClassIdxMap[word]);
+        snprintf(key, kMaxKeyLength, "%08d", wordIdxMap[word]);
+        wordShard.Insert(std::string(key), record);
+      }
+    }
+
+Compilation and running commands are provided in the *Makefile.example*.
+After executing
 
     make create
 
+, three data shards will created using the `create_shard.cc`, namely,
+*rnnlm_word_shard_train*, *rnnlm_word_shard_test* and *rnnlm_word_shard_valid*.
+
 
-### Define Layers
+### Layer implementation
 
-Similar to records, layers are also defined in "user.proto" as an extension.
+7 layers (i.e., Layer subclasses) are implemented for this application,
+including 1 [data 
layer](http://singa.incubator.apache.org/docs/layer#data-layers) which fetches 
data records from data
+shards, 2 [parser 
layers](http://singa.incubator.apache.org/docs/layer#parser-layers) which 
parses the input records, 3 neuron layers
+which transforms the word features and 1 loss layer which computes the
+objective loss.
+
+First, we illustrate the data shard and how to create it for this application. 
Then, we
+discuss the configuration and functionality of layers. Finally, we introduce 
how
+to configure a job and then run the training for your own model.
+
+Following the guide for implementing [new Layer 
subclasses](http://singa.incubator.apache.org/docs/layer#implementing-a-new-layer-subclass),
+we extend the 
[LayerProto](http://singa.incubator.apache.org/api/classsinga_1_1LayerProto.html)
+to include the configuration message of each user-defined layer as shown below
+(5 out of the 7 layers have specific configurations),
 
     package singa;
 
@@ -63,61 +199,368 @@ Similar to records, layers are also defi
         optional RnnlmDataProto rnnlmdata_conf = 207;
     }
 
-    // 1-Message that stores parameters used by RnnlmComputationLayer
-    message RnnlmComputationProto {
-        optional bool bias_term = 1 [default = true];  // use bias vector or 
not
+
+In the subsequent sections, we describe the implementation of each layer, 
including
+it configuration message.
+
+### RnnlmDataLayer
+
+It inherits [DataLayer](/api/classsinga_1_1DataLayer.html) for loading word and
+class `Record`s from `DataShard`s into memory.
+
+#### Functionality
+
+    void RnnlmDataLayer::Setup() {
+      read records from ClassShard to construct mapping from word string to 
class index
+      Resize length of records_ as window_size + 1
+      Read 1st word record to the last position
     }
 
-    // 2-Message that stores parameters used by RnnlmSigmoidLayer
-    message RnnlmSigmoidProto {
-        optional bool bias_term = 1 [default = true];  // use bias vector or 
not
+
+    void RnnlmDataLayer::ComputeFeature() {
+           records_[0] = records_[windowsize_];        //Copy the last record 
to 1st position in the record vector
+      Assign values to records_;       //Read window_size new word records 
from WordShard
     }
 
-    // 3-Message that stores parameters used by RnnlmInnerproductLayer
-    message RnnlmInnerproductProto {
-        required int32 num_output = 1;  // number of outputs for the layer
-        optional bool bias_term = 30 [default = true];  // use bias vector or 
not
+
+The `Steup` function load the mapping (from word string to class index) from
+*ClassShard*.
+
+Every time the `ComputeFeature` function is called, it loads `windowsize_` 
records
+from `WordShard`.
+
+
+[For the consistency
+of operations at each training iteration, it maintains a record vector (length
+of window_size + 1). It reads the 1st record from the WordShard and puts it in
+the last position of record vector].
+
+
+#### Configuration
+
+    message RnnlmDataProto {
+        required string class_path = 1;   // path to the class data 
file/folder, absolute or relative to the workspace
+        required string word_path = 2;    // path to the word data 
file/folder, absolute or relative to the workspace
+        required int32 window_size = 3;   // window size.
     }
 
-    // 4-Message that stores parameters used by RnnlmWordinputLayer
+[class_path to file or folder?]
+
+[There two paths, `class_path` for ...; `word_path` for..
+The `window_size` is set to ...]
+
+
+### RnnlmWordParserLayer
+
+This layer gets `window_size` word strings from the `RnnlmDataLayer` and looks
+up the word string to word index map to get word indexes.
+
+#### Functionality
+
+    void RnnlmWordparserLayer::Setup(){
+        Obtain window size from src layer;
+        Obtain vocabulary size from src layer;
+        Reshape data_ as {window_size};
+    }
+
+    void RnnlmWordparserLayer::ParseRecords(Blob* blob){
+      for each word record in the window, get its word index and insert the 
index into blob
+    }
+
+
+#### Configuration
+
+This layer does not have specific configuration fields.
+
+
+### RnnlmClassParserLayer
+
+It maps each word in the processing window into a class index.
+
+#### Functionality
+
+    void RnnlmClassparserLayer::Setup(){
+      Obtain window size from src layer;
+      Obtain vocaubulary size from src layer;
+      Obtain class size from src layer;
+      Reshape data_ as {windowsize_, 4};
+    }
+
+    void RnnlmClassparserLayer::ParseRecords(){
+      for(int i = 1; i < records.size(); i++){
+          Copy starting word index in this class to data[i]'s 1st position;
+          Copy ending word index in this class to data[i]'s 2nd position;
+          Copy index of input word to data[i]'s 3rd position;
+          Copy class index of input word to data[i]'s 4th position;
+      }
+    }
+
+The setup function read
+
+
+#### Configuration
+This layer fetches the class information (the mapping information between
+classes and words) from RnnlmDataLayer and maintains this information as data
+in this layer.
+
+
+
+Next, this layer parses the last "window_size" number of word records from
+RnnlmDataLayer and stores them as data. Then, it retrieves the corresponding
+class for each input word. It stores the starting word index of this class,
+ending word index of this class, word index and class index respectively.
+
+
+### RnnlmWordInputLayer
+
+Using the input word records, this layer obtains corresponding word vectors as
+its data. Then, it passes the data to RnnlmInnerProductLayer above for further
+processing.
+
+#### Configuration
+In this layer, the length of each word vector needs to be configured. Besides,
+whether to use bias term during the training process should also be configured
+(See more in
+[job.proto](https://github.com/kaiping/incubator-singa/blob/rnnlm/src/proto/job.proto)).
+
     message RnnlmWordinputProto {
         required int32 word_length = 1;  // vector length for each input word
         optional bool bias_term = 30 [default = true];  // use bias vector or 
not
     }
 
-    // 5-Message that stores parameters used by RnnlmWordparserLayer - nothing 
needs to be configured
-    //message RnnlmWordparserProto {
-    //}
-
-    // 6-Message that stores parameters used by RnnlmClassparserLayer - 
nothing needs to be configured
-    //message RnnlmClassparserProto {
-    //}
+#### Functionality
+In setup phase, this layer first reshapes its members such as "data", "grad",
+and "weight" matrix. Then, it obtains the vocabulary size from its source layer
+(i.e., RnnlmWordParserLayer).
+
+In the forward phase, using the "window_size" number of input word indices, the
+"window_size" number of word vectors are selected from this layer's weight
+matrix, each word index corresponding to one row.
+
+    void RnnlmWordinputLayer::ComputeFeature() {
+        for(int t = 0; t < windowsize_; t++){
+            data[t] = weight[src[t]];
+        }
+    }
 
-    // 7-Message that stores parameters used by RnnlmDataLayer
-    message RnnlmDataProto {
-        required string class_path = 1;   // path to the data file/folder, 
absolute or relative to the workspace
-        required string word_path = 2;
-        required int32 window_size = 3;   // window size.
+In the backward phase, after computing this layer's gradient in its destination
+layer (i.e., RnnlmInnerProductLayer), here the gradient of the weight matrix in
+this layer is copied (by row corresponding to word indices) from this layer's
+gradient.
+
+    void RnnlmWordinputLayer::ComputeGradient() {
+        for(int t = 0; t < windowsize_; t++){
+            gweight[src[t]] = grad[t];
+        }
     }
 
 
-### Configure Job
+### RnnlmInnerProductLayer
+
+This is a neuron layer which receives the data from RnnlmWordInputLayer and
+sends the computation results to RnnlmSigmoidLayer.
+
+#### Configuration
+In this layer, the number of neurons needs to be specified. Besides, whether to
+use a bias term should also be configured.
+
+    message RnnlmInnerproductProto {
+        required int32 num_output = 1; //Number of outputs for the layer
+        optional bool bias_term = 30 [default = true]; //Use bias vector or not
+    }
+
+#### Functionality
+
+In the forward phase, this layer is in charge of executing the dot
+multiplication between its weight matrix and the data in its source layer
+(i.e., RnnlmWordInputLayer).
+
+    void RnnlmInnerproductLayer::ComputeFeature() {
+        data = dot(src, weight);       //Dot multiplication operation
+    }
+
+In the backward phase, this layer needs to first compute the gradient of its
+source layer (i.e., RnnlmWordInputLayer). Then, it needs to compute the
+gradient of its weight matrix by aggregating computation results for each
+timestamp. The details can be seen as follows.
+
+    void RnnlmInnerproductLayer::ComputeGradient() {
+        for (int t = 0; t < windowsize_; t++) {
+            Add the dot product of src[t] and grad[t] to gweight;
+        }
+        Copy the dot product of grad and weight to gsrc;
+    }
+
+### RnnlmSigmoidLayer
+
+This is a neuron layer for computation. During the computation in this layer,
+each component of the member data specific to one timestamp uses its previous
+timestamp's data component as part of the input. This is how the time-order
+information is utilized in this language model application.
+
+Besides, if you want to implement a recurrent neural network following our
+design, this layer is of vital importance for you to refer to. Also, you can
+always think of other design methods to make use of information from past
+timestamps.
+
+#### Configuration
+
+In this layer, whether to use a bias term needs to be specified.
+
+    message RnnlmSigmoidProto {
+        optional bool bias_term = 1 [default = true];  // use bias vector or 
not
+    }
+
+#### Functionality
+
+In the forward phase, this layer first receives data from its source layer
+(i.e., RnnlmInnerProductLayer) which is used as one part input for computation.
+Then, for each timestampe this layer executes a dot multiplication between its
+previous timestamp information and its own weight matrix. The results are the
+other part for computation. This layer sums these two parts together and
+executes an activation operation. The detailed descriptions for this process
+are illustrated as follows.
+
+    void RnnlmSigmoidLayer::ComputeFeature() {
+        for(int t = 0; t < window_size; t++){
+            if(t == 0) Copy the sigmoid results of src[t] to data[t];
+            else Compute the dot product of data[t - 1] and weight, and add 
sigmoid results of src[t] to be data[t];
+       }
+    }
+
+In the backward phase, this RnnlmSigmoidLayer first updates this layer's member
+grad utilizing the information from current timestamp's next timestamp. Then
+respectively, this layer computes the gradient for its weight matrix and its
+source layer RnnlmInnerProductLayer by iterating different timestamps. The
+process can be seen below.
+
+    void RnnlmSigmoidLayer::ComputeGradient(){
+        Update grad[t];        // Update the gradient for the current layer, 
add a new term from next timestamp
+        for (int t = 0; t < windowsize_; t++) {
+                Update gweight;        // Compute the gradient for the weight 
matrix
+                Compute gsrc[t];       // Compute the gradient for src layer
+        }
+    }
+
+
+
+### RnnlmComputationLayer
+
+This layer is a loss layer in which the performance metrics, both the
+probability of predicting the next word correctly, and perplexity (PPL in
+short) are computed. To be specific, this layer is composed of the class
+information part and the word information part. Therefore, the computation can
+be essentially divided into two parts by slicing this layer's weight matrix.
+
+#### Configuration
+
+In this layer, it is needed to specify whether to use a bias term during
+training.
+
+    message RnnlmComputationProto {
+        optional bool bias_term = 1 [default = true];  // use bias vector or 
not
+    }
+
+
+#### Functionality
+
+In the forward phase, by using the two sliced weight matrices (one is for class
+information, another is for the words in this class), this
+RnnlmComputationLayer calculates the dot product between the source layer's
+input and the sliced matrices. The results can be denoted as "y1" and "y2".
+Then after a softmax function, for each input word, the probability
+distribution of classes and the words in this classes are computed. The
+activated results can be denoted as p1 and p2. Next, using the probability
+distribution, the PPL value is computed.
+
+    void RnnlmComputationLayer::ComputeFeature() {
+        Compute y1 and y2;
+        p1 = Softmax(y1);
+        p2 = Softmax(y2);
+        Compute perplexity value PPL;
+    }
+
+
+In the backward phase, this layer executes the following three computation
+operations. First, it computes the member gradient of the current layer by each
+timestamp. Second, this layer computes the gradient of its own weight matrix by
+aggregating calculated results from all timestamps. Third, it computes the
+gradient of its source layer, RnnlmSigmoidLayer, timestamp-wise.
+
+    void RnnlmComputationLayer::ComputeGradient(){
+       Compute grad[t] for all timestamps;
+        Compute gweight by aggregating results computed in different 
timestamps;
+        Compute gsrc[t] for all timestamps;
+    }
+
+
+## Updater Configuration
+
+We employ kFixedStep type of the learning rate change method and the
+configuration is as follows. We use different learning rate values in different
+step ranges. [Here](http://wangwei-pc.d1.comp.nus.edu.sg:4000/docs/updater/) is
+more information about choosing updaters.
+
+    updater{
+        #weight_decay:0.0000001
+        lr_change: kFixedStep
+        type: kSGD
+        fixedstep_conf:{
+          step:0
+          step:42810
+          step:49945
+          step:57080
+          step:64215
+          step_lr:0.1
+          step_lr:0.05
+          step_lr:0.025
+          step_lr:0.0125
+          step_lr:0.00625
+        }
+    }
+
+
+## TrainOneBatch() Function
+
+We use BP (BackPropagation) algorithm to train the RNN model here. The
+corresponding configuration can be seen below.
+
+    # In job.conf file
+    alg: kBackPropagation
+
+Refer to
+[here](http://wangwei-pc.d1.comp.nus.edu.sg:4000/docs/train-one-batch/) for
+more information on different TrainOneBatch() functions.
+
+## Cluster Configuration
+
+In this RNN language model, we configure the cluster topology as follows.
+
+    cluster {
+      nworker_groups: 1
+      nserver_groups: 1
+      nservers_per_group: 1
+      nworkers_per_group: 1
+      nservers_per_procs: 1
+      nworkers_per_procs: 1
+      workspace: "examples/rnnlm/"
+    }
+
+This is to train the model in one node. For other configuration choices, please
+refer to [here](http://wangwei-pc.d1.comp.nus.edu.sg:4000/docs/frameworks/).
+
+
+## Configure Job
 
 Job configuration is written in "job.conf".
 
 Note: Extended field names should be embraced with square-parenthesis [], 
e.g., [singa.rnnlmdata_conf].
 
 
-### Run Training
+## Run Training
 
 Start training by the following commands
 
     cd SINGA_ROOT
     ./bin/singa-run.sh -workspace=examples/rnnlm
 
-
-
-
-
-  [1]: http://singa.incubator.apache.org/docs/data.html
-  [2]: www.rnnlm.org
\ No newline at end of file


Added: incubator/singa/site/trunk/content/markdown/docs/train-one-batch.md
URL: 
http://svn.apache.org/viewvc/incubator/singa/site/trunk/content/markdown/docs/train-one-batch.md?rev=1700722&view=auto
==============================================================================
--- incubator/singa/site/trunk/content/markdown/docs/train-one-batch.md (added)
+++ incubator/singa/site/trunk/content/markdown/docs/train-one-batch.md Wed Sep 
 2 07:59:20 2015
@@ -0,0 +1,183 @@
+---
+layout: post
+title: TrainOneBatch
+category : docs
+tags : [CD, BP]
+---
+{% include JB/setup %}
+
+For each SGD iteration, every worker calls the `TrainOneBatch` function to
+compute gradients of parameters associated with local layers (i.e., layers
+dispatched to it).  SINGA has implemented two algorithms for the
+`TrainOneBatch` function. Users select the corresponding algorithm for
+their model in the configuration.
+
+## Basic user guide
+
+### Back-propagation
+
+[BP algorithm](http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf) is used for
+computing gradients of feed-forward models, e.g., 
[CNN](http://singa.incubator.apache.org/docs/cnn)
+and [MLP](http://singa.incubator.apache.org/docs/mlp), and 
[RNN](http://singa.incubator.apache.org/docs/rnn) models in SINGA.
+
+
+    # in job.conf
+    alg: kBP
+
+To use the BP algorithm for the `TrainOneBatch` function, users just simply
+configure the `alg` field with `kBP`. If a neural net contains user-defined
+layers, these layers must be implemented properly be to consistent with the
+implementation of the BP algorithm in SINGA (see below).
+
+
+### Contrastive Divergence
+
+[CD algorithm](http://www.cs.toronto.edu/~fritz/absps/nccd.pdf) is used for
+computing gradients of energy models like RBM.
+
+    # job.conf
+    alg: kCD
+    cd_conf {
+      cd_k: 2
+    }
+
+To use the CD algorithm for the `TrainOneBatch` function, users just configure
+the `alg` field to `kCD`. Uses can also configure the Gibbs sampling steps in
+the CD algorthm through the `cd_k` field. By default, it is set to 1.
+
+
+
+## Advanced user guide
+
+### Implementation of BP
+
+The BP algorithm is implemented in SINGA following the below pseudo code,
+
+    BPTrainOnebatch(step, net) {
+      // forward propagate
+      foreach layer in net.local_layers() {
+        if IsBridgeDstLayer(layer)
+          recv data from the src layer (i.e., BridgeSrcLayer)
+        foreach param in layer.params()
+          Collect(param) // recv response from servers for last update
+
+        layer.ComputeFeature(kForward)
+
+        if IsBridgeSrcLayer(layer)
+          send layer.data_ to dst layer
+      }
+      // backward propagate
+      foreach layer in reverse(net.local_layers) {
+        if IsBridgeSrcLayer(layer)
+          recv gradient from the dst layer (i.e., BridgeDstLayer)
+          recv response from servers for last update
+
+        layer.ComputeGradient()
+        foreach param in layer.params()
+          Update(step, param) // send param.grad_ to servers
+
+        if IsBridgeDstLayer(layer)
+          send layer.grad_ to src layer
+      }
+    }
+
+
+It forwards features through all local layers (can be checked by layer
+partition ID and worker ID) and backwards gradients in the reverse order.
+[BridgeSrcLayer](http://singa.incubator.apache.org/docs/layer/#bridgesrclayer--bridgedstlayer)
+(resp. `BridgeDstLayer`) will be blocked until the feature (resp.
+gradient) from the source (resp. destination) layer comes. Parameter gradients
+are sent to servers via `Update` function. Updated parameters are collected via
+`Collect` function, which will be blocked until the parameter is updated.
+[Param](http://singa.incubator.apache.org/docs/param) objects have versions, 
which can be used to
+check whether the `Param` objects have been updated or not.
+
+Since RNN models are unrolled into feed-forward models, users need to implement
+the forward propagation in the recurrent layer's `ComputeFeature` function,
+and implement the backward propagation in the recurrent layer's 
`ComputeGradient`
+function. As a result, the whole `TrainOneBatch` runs
+[back-propagation through time 
(BPTT)](https://en.wikipedia.org/wiki/Backpropagation_through_time)  algorithm.
+
+### Implementation of CD
+
+The CD algorithm is implemented in SINGA following the below pseudo code,
+
+    CDTrainOneBatch(step, net) {
+      # positive phase
+      foreach layer in net.local_layers()
+        if IsBridgeDstLayer(layer)
+          recv positive phase data from the src layer (i.e., BridgeSrcLayer)
+        foreach param in layer.params()
+          Collect(param)  // recv response from servers for last update
+        layer.ComputeFeature(kPositive)
+        if IsBridgeSrcLayer(layer)
+          send positive phase data to dst layer
+
+      # negative phase
+      foreach gibbs in [0...layer_proto_.cd_k]
+        foreach layer in net.local_layers()
+          if IsBridgeDstLayer(layer)
+            recv negative phase data from the src layer (i.e., BridgeSrcLayer)
+          layer.ComputeFeature(kPositive)
+          if IsBridgeSrcLayer(layer)
+            send negative phase data to dst layer
+
+      foreach layer in net.local_layers()
+        layer.ComputeGradient()
+        foreach param in layer.params
+          Update(param)
+    }
+
+Parameter gradients are computed after the positive phase and negative phase.
+
+### Implementing a new algorithm
+
+SINGA implements BP and CD by creating two subclasses of
+the [Worker](api/classsinga_1_1Worker.html) class:
+[BPWorker](api/classsinga_1_1BPWorker.html)'s `TrainOneBatch` function 
implements the BP
+algorithm; [CDWorker](api/classsinga_1_1CDWorker.html)'s `TrainOneBatch` 
function implements the CD
+algorithm. To implement a new algorithm for the `TrainOneBatch` function, users
+need to create a new subclass of the `Worker`, e.g.,
+
+    class FooWorker : public Worker {
+      void TrainOneBatch(int step, shared_ptr<NeuralNet> net, Metric* perf) 
override;
+      void TestOneBatch(int step, Phase phase, shared_ptr<NeuralNet> net, 
Metric* perf) override;
+    };
+
+The `FooWorker` must implement the above two functions for training one
+mini-batch and testing one mini-batch. The `perf` argument is for collecting
+training or testing performance, e.g., the objective loss or accuracy. It is
+passed to the `ComputeFeature` function of each layer.
+
+Users can define some fields for users to configure
+
+    # in user.proto
+    message FooWorkerProto {
+      optional int32 b = 1;
+    }
+
+    extend JobProto {
+      optional FooWorkerProto foo_conf = 101;
+    }
+
+    # in job.proto
+    JobProto {
+      ...
+      extension 101..max;
+    }
+
+It is similar as [adding configuration fields for a new 
layer](http://singa.incubator.apache.org/docs/layer/#implementing-a-new-layer-subclass).
+
+To use `FooWorker`, users need to register it in the 
[main.cc](http://singa.incubator.apache.org/docs/programming-guide)
+and configure the `alg` and `foo_conf` fields,
+
+    # in main.cc
+    const int kFoo = 3; // worker ID, must be different to that of CDWorker 
and BPWorker
+    driver.RegisterWorker<FooWorker>(kFoo);
+
+    # in job.conf
+    ...
+    alg: 3
+    [foo_conf] {
+      b = 4;
+    }

Added: incubator/singa/site/trunk/content/markdown/docs/updater.md
URL: 
http://svn.apache.org/viewvc/incubator/singa/site/trunk/content/markdown/docs/updater.md?rev=1700722&view=auto
==============================================================================
--- incubator/singa/site/trunk/content/markdown/docs/updater.md (added)
+++ incubator/singa/site/trunk/content/markdown/docs/updater.md Wed Sep  2 
07:59:20 2015
@@ -0,0 +1,288 @@
+---
+layout: post
+title:  Updater
+category : docs
+tags : [updater]
+---
+{% include JB/setup %}
+
+Every server in SINGA has an [Updater](api/classsinga_1_1Updater.html)
+instance that updates parameters based on gradients.
+In this page, the *Basic user guide* describes the configuration of an updater.
+The *Advanced user guide* present details on how to implement a new updater 
and a new
+learning rate changing method.
+
+## Basic user guide
+
+There are many different parameter updating protocols (i.e., subclasses of
+`Updater`). They share some configuration fields like
+
+* `type`, an integer for identifying an updater;
+* `learning_rate`, configuration for the
+[LRGenerator](http://singa.incubator.apache.org/api/classsinga_1_1LRGenerator.html)
 which controls the learning rate.
+* `weight_decay`, the co-efficient for [L2 * 
regularization](http://deeplearning.net/tutorial/gettingstarted.html#regularization).
+* 
[momentum](http://ufldl.stanford.edu/tutorial/supervised/OptimizationStochasticGradientDescent/).
+
+If you are not familiar with the above terms, you can get their meanings in
+[this page provided by 
Karpathy](http://cs231n.github.io/neural-networks-3/#update).
+
+### Configuration of built-in updater classes
+
+#### Updater
+The base `Updater` implements the [vanilla SGD 
algorithm](http://cs231n.github.io/neural-networks-3/#sgd).
+Its configuration type is `kSGD`.
+Users need to configure at least the `learning_rate` field.
+`momentum` and `weight_decay` are optional fields.
+
+    updater{
+      type: kSGD
+      momentum: float
+      weight_decay: float
+      learning_rate {
+
+      }
+    }
+
+#### AdaGradUpdater
+
+It inherits the base `Updater` to implement the
+[AdaGrad](http://www.magicbroom.info/Papers/DuchiHaSi10.pdf) algorithm.
+Its type is `kAdaGrad`.
+`AdaGradUpdater` is configured similar to `Updater` except
+that `momentum` is not used.
+
+#### NesterovUpdater
+
+It inherits the base `Updater` to implements the
+[Nesterov](http://arxiv.org/pdf/1212.0901v2.pdf) (section 3.5) updating 
protocol.
+Its type is `kNesterov`.
+`learning_rate` and `momentum` must be configured. `weight_decay` is an
+optional configuration field.
+
+#### RMSPropUpdater
+
+It inherits the base `Updater` to implements the
+[RMSProp algorithm](http://cs231n.github.io/neural-networks-3/#sgd) proposed by
+[Hinton](http://www.cs.toronto.edu/%7Etijmen/csc321/slides/lecture_slides_lec6.pdf)(slide
 29).
+Its type is `kRMSProp`.
+
+    updater {
+      type: kRMSProp
+      rmsprop_conf {
+       rho: float # [0,1]
+      }
+    }
+
+
+### Configuration of learning rate
+
+The `learning_rate` field is configured as,
+
+    learning_rate {
+      type: ChangeMethod
+      base_lr: float  # base/initial learning rate
+      ... # fields to a specific changing method
+    }
+
+The common fields include `type` and `base_lr`. SINGA provides the following
+`ChangeMethod`s.
+
+#### kFixed
+
+The `base_lr` is used for all steps.
+
+#### kLinear
+
+The updater should be configured like
+
+    learning_rate {
+      base_lr:  float
+      linear_conf {
+        freq: int
+        final_lr: float
+      }
+    }
+
+Linear interpolation is used to change the learning rate,
+
+    lr = (1 - step / freq) * base_lr + (step / freq) * final_lr
+
+#### kExponential
+
+The udapter should be configured like
+
+    learning_rate {
+      base_lr: float
+      exponential_conf {
+        freq: int
+      }
+    }
+
+The learning rate for `step` is
+
+    lr = base_lr / 2^(step / freq)
+
+#### kInverseT
+
+The updater should be configured like
+
+    learning_rate {
+      base_lr: float
+      inverset_conf {
+        final_lr: float
+      }
+    }
+
+The learning rate for `step` is
+
+    lr = base_lr / (1 + step / final_lr)
+
+#### kInverse
+
+The updater should be configured like
+
+    learning_rate {
+      base_lr: float
+      inverse_conf {
+        gamma: float
+        pow: float
+      }
+    }
+
+
+The learning rate for `step` is
+
+    lr = base_lr * (1 + gamma * setp)^(-pow)
+
+
+#### kStep
+
+The updater should be configured like
+
+    learning_rate {
+      base_lr : float
+      step_conf {
+        change_freq: int
+        gamma: float
+      }
+    }
+
+
+The learning rate for `step` is
+
+    lr = base_lr * gamma^ (step / change_freq)
+
+#### kFixedStep
+
+The updater should be configured like
+
+    learning_rate {
+      fixedstep_conf {
+        step: int
+        step_lr: float
+
+        step: int
+        step_lr: float
+
+        ...
+      }
+    }
+
+Denote the i-th tuple as (step[i], step_lr[i]), then the learning rate for
+`step` is,
+
+    step_lr[k]
+
+where step[k] is the smallest number that is larger than `step`.
+
+
+## Advanced user guide
+
+### Implementing a new Update subclass
+
+The base Updater class has one virtual function,
+
+    class Updater{
+     public:
+      virtual void Update(int step, Param* param, float grad_scale = 1.0f) = 0;
+
+     protected:
+      UpdaterProto proto_;
+      LRGenerator lr_gen_;
+    };
+
+It updates the values of the `param` based on its gradients. The `step`
+argument is for deciding the learning rate which may change through time
+(step). `grad_scale` scales the original gradient values. This function is
+called by servers once it receives all gradients for the same `Param` object.
+
+To implement a new Updater subclass, users must override the `Update` function.
+
+    class FooUpdater : public Updater {
+      void Update(int step, Param* param, float grad_scale = 1.0f) override;
+    };
+
+Configuration of this new updater can be declared similar to that of a new
+layer,
+
+    # in user.proto
+    FooUpdaterProto {
+      optional int32 c = 1;
+    }
+
+    extend UpdaterProto {
+      optional FooUpdaterProto fooupdater_conf= 101;
+    }
+
+The new updater should be registered in the
+[main function](http://singa.incubator.apache.org/docs/programming-guide)
+
+    driver.RegisterUpdater<FooUpdater>("FooUpdater");
+
+Users can then configure the job as
+
+    # in job.conf
+    updater {
+      user_type: "FooUpdater"  # must use user_type with the same string 
identifier as the one used for registration
+      fooupdater_conf {
+        c : 20;
+      }
+    }
+
+### Implementing a new LRGenerator subclass
+
+The base `LRGenerator` is declared as,
+
+    virtual float Get(int step);
+
+To implement a subclass, e.g., `FooLRGen`, users should declare it like
+
+    class FooLRGen : public LRGenerator {
+     public:
+      float Get(int step) override;
+    };
+
+Configuration of `FooLRGen` can be defined using a protocol message,
+
+    # in user.proto
+    message FooLRProto {
+     ...
+    }
+
+    extend LRGenProto {
+      optional FooLRProto foolr_conf = 101;
+    }
+
+The configuration is then like,
+
+    learning_rate {
+      user_type : "FooLR" # must use user_type with the same string identifier 
as the one used for registration
+      base_lr: float
+      foolr_conf {
+        ...
+      }
+    }
+
+Users have to register this subclass in the main function,
+
+      driver.RegisterLRGenerator<FooLRGen>("FooLR")

svn commit: r1700722 [3/3] - /incubator/singa/site/trunk/content/markdown/docs/

Reply via email to