Modified: incubator/singa/site/trunk/content/markdown/docs/rnn.md URL: http://svn.apache.org/viewvc/incubator/singa/site/trunk/content/markdown/docs/rnn.md?rev=1700722&r1=1700721&r2=1700722&view=diff ============================================================================== --- incubator/singa/site/trunk/content/markdown/docs/rnn.md (original) +++ incubator/singa/site/trunk/content/markdown/docs/rnn.md Wed Sep 2 07:59:20 2015 @@ -1,19 +1,85 @@ -## Recurrent neural networks (RNN) +--- +layout: post +title: Example --- Recurrent Neural Network +category : docs +tags : [rnn, example]http://singa.incubator.apache.org +{% include JB/setup %} -Example files for RNN can be found in "SINGA_ROOT/examples/rnnlm", which we assume to be WORKSPACE. -### Create DataShard +Recurrent Neural Networks (RNN) are widely used for modeling sequential data, +such as music, videos and sentences. In this example, we use SINGA to train a +[RNN model](http://www.fit.vutbr.cz/research/groups/speech/publi/2010/mikolov_interspeech2010_IS100722.pdf) +proposed by Tomas Mikolov for [language modeling](https://en.wikipedia.org/wiki/Language_model). +The training objective (loss) is +minimize the [perplexity per word](https://en.wikipedia.org/wiki/Perplexity), which +is equivalent to maximize the probability of predicting the next word given the current word in +a sentence. -(a) Define your own record. Please refer to [Data Preparation][1] for details. +Different to the [CNN](http://singa.incubator.apache.org/docs/cnn), [MLP](http://singa.incubator.apache.org/docs/mlp) +and [RBM](http://singa.incubator.apache.org/docs/rbm) examples which use built-in +[Layer](http://singa.incubator.apache.org/docs/layer)s and [Record](http://singa.incubator.apache.org/docs/data)s, +none of the layers in this model is built-in. Hence users can get examples of +implementing their own Layers and data Records in this page. -Records for RNN example are defined in "user.proto" as an extension. +## Running instructions +In *SINGA_ROOT/examples/rnn/*, scripts are provided to run the training job. +First, the data is prepared by + + $ cp Makefile.example Makefile + $ make download + $ make create + +Second, the training is started by passing the job configuration as, + + # in SINGA_ROOT + $ ./bin/singa-run.sh -conf SINGA_ROOT/examples/rnn/job.conf + + + +## Implementations + +<img src="http://singa.incubator.apache.org/assets/image/rnn-refine.png" align="center" width="300px"/> +<span><strong>Figure 1 - Net structure of the RNN model.</strong></span> + +The neural net structure is shown Figure 1. +Word records are loaded by `RnnlmDataLayer` from `WordShard`. `RnnlmWordparserLayer` +parses word records to get word indexes (in the vocabulary). For every iteration, +`window_size` words are processed. `RnnlmWordinputLayer` looks up a word +embedding matrix to extract feature vectors for words in the window. +These features are transformed by `RnnlmInnerproductLayer` layer and `RnnlmSigmoidLayer`. +`RnnlmSigmoidLayer` is a recurrent layer that forwards features from previous words +to next words. Finally, `RnnlmComputationLayer` computes the perplexity loss with +word class information from `RnnlmClassparserLayer`. The word class is a cluster ID. +Words are clustered based on their frequency in the dataset, e.g., frequent words +are clustered together and less frequent words are clustered together. Clustering +is to improve the efficiency of the final prediction process. + +### Data preparation + +We use a small dataset in this example. In this dataset, [dataset description, e.g., format]. +The subsequent steps follow the instructions in +[Data Preparation](http://singa.incubator.apache.org/docs/data) to convert the +raw data into `Record`s and insert them into `DataShard`s. + +#### Download source data + + # in SINGA_ROOT/examples/rnn/ + wget http://www.fit.vutbr.cz/~imikolov/rnnlm/simple-examples.tgz + xxx + + +#### Define your own record. + +Since this dataset has different format as the built-in `SingleLabelImageRecord`, +we need to extend the base `Record` to add new fields, + + # in SINGA_ROOT/examples/rnn/user.proto package singa; - import "common.proto"; // Record message for SINGA is defined - import "job.proto"; // Layer message for SINGA is defined + import "common.proto"; // import SINGA Record - extend Record { + extend Record { // extend base Record to include users' records optional WordClassRecord wordclass = 101; optional SingleWordRecord singleword = 102; } @@ -30,24 +96,94 @@ Records for RNN example are defined in " optional int32 class_index = 3; // the index of the class corresponding to this word } -(b) Download raw data -This example downloads rnnlm-0.4b from [www.rnnlm.org][2] by a command +#### Create data shard for training and testing + +{% comment %} +As the vocabulary size is very large, the original perplexity calculation method +is time consuming. Because it has to calculate the probabilities of all possible +words for - make download + p(wt|w0, w1, ... wt-1). -The raw data is stored in a folder "rnnlm-0.4b/train" and "rnnlm-0.4b/test". -(c) Create data shard for training and testing +Tomas proposed to divide all +words into different classes according to the word frequency, and compute the +perplexity according to -Data shards (e.g., "shard.dat") will be created in "rnnlm_class_shard", "rnnlm_vocab_shard", "rnnlm_word_shard_train" and "rnnlm_word_shard_test" by a command + p(wt|w0, w1, ... wt-1) = p(c|w0,w1,..wt-1) p(w|c) + +where `c` is the word class, `w0, w1...wt-1` are the previous words before `wt`. +The probabilities on the right side can be computed faster than + + +[Makefile](https://github.com/kaiping/incubator-singa/blob/rnnlm/examples/rnnlm/Makefile) +for creating the shards (see in + [create_shard.cc](https://github.com/kaiping/incubator-singa/blob/rnnlm/examples/rnnlm/create_shard.cc)), + we need to specify where to download the source data, number of classes we + want to divide all occurring words into, and all the shards together with + their names, directories we want to create. +{% endcomment %} + +*SINGA_ROOT/examples/rnn/create_shard.cc* defines the following function for creating data shards, + + void create_shard(const char *input, int nclass) { + +`input` is the path to [the text file], `nclass` is user specified cluster size. +This function starts with + + using StrIntMap = std::map<std::string, int>; + StrIntMap *wordIdxMapPtr; // Mapping word string to a word index + StrIntMap *wordClassIdxMapPtr; // Mapping word string to a word class index + if (-1 == nclass) { + loadClusterForNonTrainMode(input, nclass, &wordIdxMap, &wordClassIdxMap); // non-training phase + } else { + doClusterForTrainMode(input, nclass, &wordIdxMap, &wordClassIdxMap); // training phase + } + + + * If `-1 == nclass`, `path` points to the training data file. `doClusterForTrainMode` + reads all the words in the file to create the two maps. [The two maps are stored in xxx] + * otherwise, `path` points to either test or validation data file. `loadClusterForNonTrainMode` + loads the two maps from [xxx]. + +Words from training/text/validation files are converted into `Record`s by + + singa::SingleWordRecord *wordRecord = record.MutableExtension(singa::singleword); + while (in >> word) { + wordRecord->set_word(word); + wordRecord->set_word_index(wordIdxMap[word]); + wordRecord->set_class_index(wordClassIdxMap[word]); + snprintf(key, kMaxKeyLength, "%08d", wordIdxMap[word]); + wordShard.Insert(std::string(key), record); + } + } + +Compilation and running commands are provided in the *Makefile.example*. +After executing make create +, three data shards will created using the `create_shard.cc`, namely, +*rnnlm_word_shard_train*, *rnnlm_word_shard_test* and *rnnlm_word_shard_valid*. + -### Define Layers +### Layer implementation -Similar to records, layers are also defined in "user.proto" as an extension. +7 layers (i.e., Layer subclasses) are implemented for this application, +including 1 [data layer](http://singa.incubator.apache.org/docs/layer#data-layers) which fetches data records from data +shards, 2 [parser layers](http://singa.incubator.apache.org/docs/layer#parser-layers) which parses the input records, 3 neuron layers +which transforms the word features and 1 loss layer which computes the +objective loss. + +First, we illustrate the data shard and how to create it for this application. Then, we +discuss the configuration and functionality of layers. Finally, we introduce how +to configure a job and then run the training for your own model. + +Following the guide for implementing [new Layer subclasses](http://singa.incubator.apache.org/docs/layer#implementing-a-new-layer-subclass), +we extend the [LayerProto](http://singa.incubator.apache.org/api/classsinga_1_1LayerProto.html) +to include the configuration message of each user-defined layer as shown below +(5 out of the 7 layers have specific configurations), package singa; @@ -63,61 +199,368 @@ Similar to records, layers are also defi optional RnnlmDataProto rnnlmdata_conf = 207; } - // 1-Message that stores parameters used by RnnlmComputationLayer - message RnnlmComputationProto { - optional bool bias_term = 1 [default = true]; // use bias vector or not + +In the subsequent sections, we describe the implementation of each layer, including +it configuration message. + +### RnnlmDataLayer + +It inherits [DataLayer](/api/classsinga_1_1DataLayer.html) for loading word and +class `Record`s from `DataShard`s into memory. + +#### Functionality + + void RnnlmDataLayer::Setup() { + read records from ClassShard to construct mapping from word string to class index + Resize length of records_ as window_size + 1 + Read 1st word record to the last position } - // 2-Message that stores parameters used by RnnlmSigmoidLayer - message RnnlmSigmoidProto { - optional bool bias_term = 1 [default = true]; // use bias vector or not + + void RnnlmDataLayer::ComputeFeature() { + records_[0] = records_[windowsize_]; //Copy the last record to 1st position in the record vector + Assign values to records_; //Read window_size new word records from WordShard } - // 3-Message that stores parameters used by RnnlmInnerproductLayer - message RnnlmInnerproductProto { - required int32 num_output = 1; // number of outputs for the layer - optional bool bias_term = 30 [default = true]; // use bias vector or not + +The `Steup` function load the mapping (from word string to class index) from +*ClassShard*. + +Every time the `ComputeFeature` function is called, it loads `windowsize_` records +from `WordShard`. + + +[For the consistency +of operations at each training iteration, it maintains a record vector (length +of window_size + 1). It reads the 1st record from the WordShard and puts it in +the last position of record vector]. + + +#### Configuration + + message RnnlmDataProto { + required string class_path = 1; // path to the class data file/folder, absolute or relative to the workspace + required string word_path = 2; // path to the word data file/folder, absolute or relative to the workspace + required int32 window_size = 3; // window size. } - // 4-Message that stores parameters used by RnnlmWordinputLayer +[class_path to file or folder?] + +[There two paths, `class_path` for ...; `word_path` for.. +The `window_size` is set to ...] + + +### RnnlmWordParserLayer + +This layer gets `window_size` word strings from the `RnnlmDataLayer` and looks +up the word string to word index map to get word indexes. + +#### Functionality + + void RnnlmWordparserLayer::Setup(){ + Obtain window size from src layer; + Obtain vocabulary size from src layer; + Reshape data_ as {window_size}; + } + + void RnnlmWordparserLayer::ParseRecords(Blob* blob){ + for each word record in the window, get its word index and insert the index into blob + } + + +#### Configuration + +This layer does not have specific configuration fields. + + +### RnnlmClassParserLayer + +It maps each word in the processing window into a class index. + +#### Functionality + + void RnnlmClassparserLayer::Setup(){ + Obtain window size from src layer; + Obtain vocaubulary size from src layer; + Obtain class size from src layer; + Reshape data_ as {windowsize_, 4}; + } + + void RnnlmClassparserLayer::ParseRecords(){ + for(int i = 1; i < records.size(); i++){ + Copy starting word index in this class to data[i]'s 1st position; + Copy ending word index in this class to data[i]'s 2nd position; + Copy index of input word to data[i]'s 3rd position; + Copy class index of input word to data[i]'s 4th position; + } + } + +The setup function read + + +#### Configuration +This layer fetches the class information (the mapping information between +classes and words) from RnnlmDataLayer and maintains this information as data +in this layer. + + + +Next, this layer parses the last "window_size" number of word records from +RnnlmDataLayer and stores them as data. Then, it retrieves the corresponding +class for each input word. It stores the starting word index of this class, +ending word index of this class, word index and class index respectively. + + +### RnnlmWordInputLayer + +Using the input word records, this layer obtains corresponding word vectors as +its data. Then, it passes the data to RnnlmInnerProductLayer above for further +processing. + +#### Configuration +In this layer, the length of each word vector needs to be configured. Besides, +whether to use bias term during the training process should also be configured +(See more in +[job.proto](https://github.com/kaiping/incubator-singa/blob/rnnlm/src/proto/job.proto)). + message RnnlmWordinputProto { required int32 word_length = 1; // vector length for each input word optional bool bias_term = 30 [default = true]; // use bias vector or not } - // 5-Message that stores parameters used by RnnlmWordparserLayer - nothing needs to be configured - //message RnnlmWordparserProto { - //} - - // 6-Message that stores parameters used by RnnlmClassparserLayer - nothing needs to be configured - //message RnnlmClassparserProto { - //} +#### Functionality +In setup phase, this layer first reshapes its members such as "data", "grad", +and "weight" matrix. Then, it obtains the vocabulary size from its source layer +(i.e., RnnlmWordParserLayer). + +In the forward phase, using the "window_size" number of input word indices, the +"window_size" number of word vectors are selected from this layer's weight +matrix, each word index corresponding to one row. + + void RnnlmWordinputLayer::ComputeFeature() { + for(int t = 0; t < windowsize_; t++){ + data[t] = weight[src[t]]; + } + } - // 7-Message that stores parameters used by RnnlmDataLayer - message RnnlmDataProto { - required string class_path = 1; // path to the data file/folder, absolute or relative to the workspace - required string word_path = 2; - required int32 window_size = 3; // window size. +In the backward phase, after computing this layer's gradient in its destination +layer (i.e., RnnlmInnerProductLayer), here the gradient of the weight matrix in +this layer is copied (by row corresponding to word indices) from this layer's +gradient. + + void RnnlmWordinputLayer::ComputeGradient() { + for(int t = 0; t < windowsize_; t++){ + gweight[src[t]] = grad[t]; + } } -### Configure Job +### RnnlmInnerProductLayer + +This is a neuron layer which receives the data from RnnlmWordInputLayer and +sends the computation results to RnnlmSigmoidLayer. + +#### Configuration +In this layer, the number of neurons needs to be specified. Besides, whether to +use a bias term should also be configured. + + message RnnlmInnerproductProto { + required int32 num_output = 1; //Number of outputs for the layer + optional bool bias_term = 30 [default = true]; //Use bias vector or not + } + +#### Functionality + +In the forward phase, this layer is in charge of executing the dot +multiplication between its weight matrix and the data in its source layer +(i.e., RnnlmWordInputLayer). + + void RnnlmInnerproductLayer::ComputeFeature() { + data = dot(src, weight); //Dot multiplication operation + } + +In the backward phase, this layer needs to first compute the gradient of its +source layer (i.e., RnnlmWordInputLayer). Then, it needs to compute the +gradient of its weight matrix by aggregating computation results for each +timestamp. The details can be seen as follows. + + void RnnlmInnerproductLayer::ComputeGradient() { + for (int t = 0; t < windowsize_; t++) { + Add the dot product of src[t] and grad[t] to gweight; + } + Copy the dot product of grad and weight to gsrc; + } + +### RnnlmSigmoidLayer + +This is a neuron layer for computation. During the computation in this layer, +each component of the member data specific to one timestamp uses its previous +timestamp's data component as part of the input. This is how the time-order +information is utilized in this language model application. + +Besides, if you want to implement a recurrent neural network following our +design, this layer is of vital importance for you to refer to. Also, you can +always think of other design methods to make use of information from past +timestamps. + +#### Configuration + +In this layer, whether to use a bias term needs to be specified. + + message RnnlmSigmoidProto { + optional bool bias_term = 1 [default = true]; // use bias vector or not + } + +#### Functionality + +In the forward phase, this layer first receives data from its source layer +(i.e., RnnlmInnerProductLayer) which is used as one part input for computation. +Then, for each timestampe this layer executes a dot multiplication between its +previous timestamp information and its own weight matrix. The results are the +other part for computation. This layer sums these two parts together and +executes an activation operation. The detailed descriptions for this process +are illustrated as follows. + + void RnnlmSigmoidLayer::ComputeFeature() { + for(int t = 0; t < window_size; t++){ + if(t == 0) Copy the sigmoid results of src[t] to data[t]; + else Compute the dot product of data[t - 1] and weight, and add sigmoid results of src[t] to be data[t]; + } + } + +In the backward phase, this RnnlmSigmoidLayer first updates this layer's member +grad utilizing the information from current timestamp's next timestamp. Then +respectively, this layer computes the gradient for its weight matrix and its +source layer RnnlmInnerProductLayer by iterating different timestamps. The +process can be seen below. + + void RnnlmSigmoidLayer::ComputeGradient(){ + Update grad[t]; // Update the gradient for the current layer, add a new term from next timestamp + for (int t = 0; t < windowsize_; t++) { + Update gweight; // Compute the gradient for the weight matrix + Compute gsrc[t]; // Compute the gradient for src layer + } + } + + + +### RnnlmComputationLayer + +This layer is a loss layer in which the performance metrics, both the +probability of predicting the next word correctly, and perplexity (PPL in +short) are computed. To be specific, this layer is composed of the class +information part and the word information part. Therefore, the computation can +be essentially divided into two parts by slicing this layer's weight matrix. + +#### Configuration + +In this layer, it is needed to specify whether to use a bias term during +training. + + message RnnlmComputationProto { + optional bool bias_term = 1 [default = true]; // use bias vector or not + } + + +#### Functionality + +In the forward phase, by using the two sliced weight matrices (one is for class +information, another is for the words in this class), this +RnnlmComputationLayer calculates the dot product between the source layer's +input and the sliced matrices. The results can be denoted as "y1" and "y2". +Then after a softmax function, for each input word, the probability +distribution of classes and the words in this classes are computed. The +activated results can be denoted as p1 and p2. Next, using the probability +distribution, the PPL value is computed. + + void RnnlmComputationLayer::ComputeFeature() { + Compute y1 and y2; + p1 = Softmax(y1); + p2 = Softmax(y2); + Compute perplexity value PPL; + } + + +In the backward phase, this layer executes the following three computation +operations. First, it computes the member gradient of the current layer by each +timestamp. Second, this layer computes the gradient of its own weight matrix by +aggregating calculated results from all timestamps. Third, it computes the +gradient of its source layer, RnnlmSigmoidLayer, timestamp-wise. + + void RnnlmComputationLayer::ComputeGradient(){ + Compute grad[t] for all timestamps; + Compute gweight by aggregating results computed in different timestamps; + Compute gsrc[t] for all timestamps; + } + + +## Updater Configuration + +We employ kFixedStep type of the learning rate change method and the +configuration is as follows. We use different learning rate values in different +step ranges. [Here](http://wangwei-pc.d1.comp.nus.edu.sg:4000/docs/updater/) is +more information about choosing updaters. + + updater{ + #weight_decay:0.0000001 + lr_change: kFixedStep + type: kSGD + fixedstep_conf:{ + step:0 + step:42810 + step:49945 + step:57080 + step:64215 + step_lr:0.1 + step_lr:0.05 + step_lr:0.025 + step_lr:0.0125 + step_lr:0.00625 + } + } + + +## TrainOneBatch() Function + +We use BP (BackPropagation) algorithm to train the RNN model here. The +corresponding configuration can be seen below. + + # In job.conf file + alg: kBackPropagation + +Refer to +[here](http://wangwei-pc.d1.comp.nus.edu.sg:4000/docs/train-one-batch/) for +more information on different TrainOneBatch() functions. + +## Cluster Configuration + +In this RNN language model, we configure the cluster topology as follows. + + cluster { + nworker_groups: 1 + nserver_groups: 1 + nservers_per_group: 1 + nworkers_per_group: 1 + nservers_per_procs: 1 + nworkers_per_procs: 1 + workspace: "examples/rnnlm/" + } + +This is to train the model in one node. For other configuration choices, please +refer to [here](http://wangwei-pc.d1.comp.nus.edu.sg:4000/docs/frameworks/). + + +## Configure Job Job configuration is written in "job.conf". Note: Extended field names should be embraced with square-parenthesis [], e.g., [singa.rnnlmdata_conf]. -### Run Training +## Run Training Start training by the following commands cd SINGA_ROOT ./bin/singa-run.sh -workspace=examples/rnnlm - - - - - [1]: http://singa.incubator.apache.org/docs/data.html - [2]: www.rnnlm.org \ No newline at end of file
Added: incubator/singa/site/trunk/content/markdown/docs/train-one-batch.md URL: http://svn.apache.org/viewvc/incubator/singa/site/trunk/content/markdown/docs/train-one-batch.md?rev=1700722&view=auto ============================================================================== --- incubator/singa/site/trunk/content/markdown/docs/train-one-batch.md (added) +++ incubator/singa/site/trunk/content/markdown/docs/train-one-batch.md Wed Sep 2 07:59:20 2015 @@ -0,0 +1,183 @@ +--- +layout: post +title: TrainOneBatch +category : docs +tags : [CD, BP] +--- +{% include JB/setup %} + +For each SGD iteration, every worker calls the `TrainOneBatch` function to +compute gradients of parameters associated with local layers (i.e., layers +dispatched to it). SINGA has implemented two algorithms for the +`TrainOneBatch` function. Users select the corresponding algorithm for +their model in the configuration. + +## Basic user guide + +### Back-propagation + +[BP algorithm](http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf) is used for +computing gradients of feed-forward models, e.g., [CNN](http://singa.incubator.apache.org/docs/cnn) +and [MLP](http://singa.incubator.apache.org/docs/mlp), and [RNN](http://singa.incubator.apache.org/docs/rnn) models in SINGA. + + + # in job.conf + alg: kBP + +To use the BP algorithm for the `TrainOneBatch` function, users just simply +configure the `alg` field with `kBP`. If a neural net contains user-defined +layers, these layers must be implemented properly be to consistent with the +implementation of the BP algorithm in SINGA (see below). + + +### Contrastive Divergence + +[CD algorithm](http://www.cs.toronto.edu/~fritz/absps/nccd.pdf) is used for +computing gradients of energy models like RBM. + + # job.conf + alg: kCD + cd_conf { + cd_k: 2 + } + +To use the CD algorithm for the `TrainOneBatch` function, users just configure +the `alg` field to `kCD`. Uses can also configure the Gibbs sampling steps in +the CD algorthm through the `cd_k` field. By default, it is set to 1. + + + +## Advanced user guide + +### Implementation of BP + +The BP algorithm is implemented in SINGA following the below pseudo code, + + BPTrainOnebatch(step, net) { + // forward propagate + foreach layer in net.local_layers() { + if IsBridgeDstLayer(layer) + recv data from the src layer (i.e., BridgeSrcLayer) + foreach param in layer.params() + Collect(param) // recv response from servers for last update + + layer.ComputeFeature(kForward) + + if IsBridgeSrcLayer(layer) + send layer.data_ to dst layer + } + // backward propagate + foreach layer in reverse(net.local_layers) { + if IsBridgeSrcLayer(layer) + recv gradient from the dst layer (i.e., BridgeDstLayer) + recv response from servers for last update + + layer.ComputeGradient() + foreach param in layer.params() + Update(step, param) // send param.grad_ to servers + + if IsBridgeDstLayer(layer) + send layer.grad_ to src layer + } + } + + +It forwards features through all local layers (can be checked by layer +partition ID and worker ID) and backwards gradients in the reverse order. +[BridgeSrcLayer](http://singa.incubator.apache.org/docs/layer/#bridgesrclayer--bridgedstlayer) +(resp. `BridgeDstLayer`) will be blocked until the feature (resp. +gradient) from the source (resp. destination) layer comes. Parameter gradients +are sent to servers via `Update` function. Updated parameters are collected via +`Collect` function, which will be blocked until the parameter is updated. +[Param](http://singa.incubator.apache.org/docs/param) objects have versions, which can be used to +check whether the `Param` objects have been updated or not. + +Since RNN models are unrolled into feed-forward models, users need to implement +the forward propagation in the recurrent layer's `ComputeFeature` function, +and implement the backward propagation in the recurrent layer's `ComputeGradient` +function. As a result, the whole `TrainOneBatch` runs +[back-propagation through time (BPTT)](https://en.wikipedia.org/wiki/Backpropagation_through_time) algorithm. + +### Implementation of CD + +The CD algorithm is implemented in SINGA following the below pseudo code, + + CDTrainOneBatch(step, net) { + # positive phase + foreach layer in net.local_layers() + if IsBridgeDstLayer(layer) + recv positive phase data from the src layer (i.e., BridgeSrcLayer) + foreach param in layer.params() + Collect(param) // recv response from servers for last update + layer.ComputeFeature(kPositive) + if IsBridgeSrcLayer(layer) + send positive phase data to dst layer + + # negative phase + foreach gibbs in [0...layer_proto_.cd_k] + foreach layer in net.local_layers() + if IsBridgeDstLayer(layer) + recv negative phase data from the src layer (i.e., BridgeSrcLayer) + layer.ComputeFeature(kPositive) + if IsBridgeSrcLayer(layer) + send negative phase data to dst layer + + foreach layer in net.local_layers() + layer.ComputeGradient() + foreach param in layer.params + Update(param) + } + +Parameter gradients are computed after the positive phase and negative phase. + +### Implementing a new algorithm + +SINGA implements BP and CD by creating two subclasses of +the [Worker](api/classsinga_1_1Worker.html) class: +[BPWorker](api/classsinga_1_1BPWorker.html)'s `TrainOneBatch` function implements the BP +algorithm; [CDWorker](api/classsinga_1_1CDWorker.html)'s `TrainOneBatch` function implements the CD +algorithm. To implement a new algorithm for the `TrainOneBatch` function, users +need to create a new subclass of the `Worker`, e.g., + + class FooWorker : public Worker { + void TrainOneBatch(int step, shared_ptr<NeuralNet> net, Metric* perf) override; + void TestOneBatch(int step, Phase phase, shared_ptr<NeuralNet> net, Metric* perf) override; + }; + +The `FooWorker` must implement the above two functions for training one +mini-batch and testing one mini-batch. The `perf` argument is for collecting +training or testing performance, e.g., the objective loss or accuracy. It is +passed to the `ComputeFeature` function of each layer. + +Users can define some fields for users to configure + + # in user.proto + message FooWorkerProto { + optional int32 b = 1; + } + + extend JobProto { + optional FooWorkerProto foo_conf = 101; + } + + # in job.proto + JobProto { + ... + extension 101..max; + } + +It is similar as [adding configuration fields for a new layer](http://singa.incubator.apache.org/docs/layer/#implementing-a-new-layer-subclass). + +To use `FooWorker`, users need to register it in the [main.cc](http://singa.incubator.apache.org/docs/programming-guide) +and configure the `alg` and `foo_conf` fields, + + # in main.cc + const int kFoo = 3; // worker ID, must be different to that of CDWorker and BPWorker + driver.RegisterWorker<FooWorker>(kFoo); + + # in job.conf + ... + alg: 3 + [foo_conf] { + b = 4; + } Added: incubator/singa/site/trunk/content/markdown/docs/updater.md URL: http://svn.apache.org/viewvc/incubator/singa/site/trunk/content/markdown/docs/updater.md?rev=1700722&view=auto ============================================================================== --- incubator/singa/site/trunk/content/markdown/docs/updater.md (added) +++ incubator/singa/site/trunk/content/markdown/docs/updater.md Wed Sep 2 07:59:20 2015 @@ -0,0 +1,288 @@ +--- +layout: post +title: Updater +category : docs +tags : [updater] +--- +{% include JB/setup %} + +Every server in SINGA has an [Updater](api/classsinga_1_1Updater.html) +instance that updates parameters based on gradients. +In this page, the *Basic user guide* describes the configuration of an updater. +The *Advanced user guide* present details on how to implement a new updater and a new +learning rate changing method. + +## Basic user guide + +There are many different parameter updating protocols (i.e., subclasses of +`Updater`). They share some configuration fields like + +* `type`, an integer for identifying an updater; +* `learning_rate`, configuration for the +[LRGenerator](http://singa.incubator.apache.org/api/classsinga_1_1LRGenerator.html) which controls the learning rate. +* `weight_decay`, the co-efficient for [L2 * regularization](http://deeplearning.net/tutorial/gettingstarted.html#regularization). +* [momentum](http://ufldl.stanford.edu/tutorial/supervised/OptimizationStochasticGradientDescent/). + +If you are not familiar with the above terms, you can get their meanings in +[this page provided by Karpathy](http://cs231n.github.io/neural-networks-3/#update). + +### Configuration of built-in updater classes + +#### Updater +The base `Updater` implements the [vanilla SGD algorithm](http://cs231n.github.io/neural-networks-3/#sgd). +Its configuration type is `kSGD`. +Users need to configure at least the `learning_rate` field. +`momentum` and `weight_decay` are optional fields. + + updater{ + type: kSGD + momentum: float + weight_decay: float + learning_rate { + + } + } + +#### AdaGradUpdater + +It inherits the base `Updater` to implement the +[AdaGrad](http://www.magicbroom.info/Papers/DuchiHaSi10.pdf) algorithm. +Its type is `kAdaGrad`. +`AdaGradUpdater` is configured similar to `Updater` except +that `momentum` is not used. + +#### NesterovUpdater + +It inherits the base `Updater` to implements the +[Nesterov](http://arxiv.org/pdf/1212.0901v2.pdf) (section 3.5) updating protocol. +Its type is `kNesterov`. +`learning_rate` and `momentum` must be configured. `weight_decay` is an +optional configuration field. + +#### RMSPropUpdater + +It inherits the base `Updater` to implements the +[RMSProp algorithm](http://cs231n.github.io/neural-networks-3/#sgd) proposed by +[Hinton](http://www.cs.toronto.edu/%7Etijmen/csc321/slides/lecture_slides_lec6.pdf)(slide 29). +Its type is `kRMSProp`. + + updater { + type: kRMSProp + rmsprop_conf { + rho: float # [0,1] + } + } + + +### Configuration of learning rate + +The `learning_rate` field is configured as, + + learning_rate { + type: ChangeMethod + base_lr: float # base/initial learning rate + ... # fields to a specific changing method + } + +The common fields include `type` and `base_lr`. SINGA provides the following +`ChangeMethod`s. + +#### kFixed + +The `base_lr` is used for all steps. + +#### kLinear + +The updater should be configured like + + learning_rate { + base_lr: float + linear_conf { + freq: int + final_lr: float + } + } + +Linear interpolation is used to change the learning rate, + + lr = (1 - step / freq) * base_lr + (step / freq) * final_lr + +#### kExponential + +The udapter should be configured like + + learning_rate { + base_lr: float + exponential_conf { + freq: int + } + } + +The learning rate for `step` is + + lr = base_lr / 2^(step / freq) + +#### kInverseT + +The updater should be configured like + + learning_rate { + base_lr: float + inverset_conf { + final_lr: float + } + } + +The learning rate for `step` is + + lr = base_lr / (1 + step / final_lr) + +#### kInverse + +The updater should be configured like + + learning_rate { + base_lr: float + inverse_conf { + gamma: float + pow: float + } + } + + +The learning rate for `step` is + + lr = base_lr * (1 + gamma * setp)^(-pow) + + +#### kStep + +The updater should be configured like + + learning_rate { + base_lr : float + step_conf { + change_freq: int + gamma: float + } + } + + +The learning rate for `step` is + + lr = base_lr * gamma^ (step / change_freq) + +#### kFixedStep + +The updater should be configured like + + learning_rate { + fixedstep_conf { + step: int + step_lr: float + + step: int + step_lr: float + + ... + } + } + +Denote the i-th tuple as (step[i], step_lr[i]), then the learning rate for +`step` is, + + step_lr[k] + +where step[k] is the smallest number that is larger than `step`. + + +## Advanced user guide + +### Implementing a new Update subclass + +The base Updater class has one virtual function, + + class Updater{ + public: + virtual void Update(int step, Param* param, float grad_scale = 1.0f) = 0; + + protected: + UpdaterProto proto_; + LRGenerator lr_gen_; + }; + +It updates the values of the `param` based on its gradients. The `step` +argument is for deciding the learning rate which may change through time +(step). `grad_scale` scales the original gradient values. This function is +called by servers once it receives all gradients for the same `Param` object. + +To implement a new Updater subclass, users must override the `Update` function. + + class FooUpdater : public Updater { + void Update(int step, Param* param, float grad_scale = 1.0f) override; + }; + +Configuration of this new updater can be declared similar to that of a new +layer, + + # in user.proto + FooUpdaterProto { + optional int32 c = 1; + } + + extend UpdaterProto { + optional FooUpdaterProto fooupdater_conf= 101; + } + +The new updater should be registered in the +[main function](http://singa.incubator.apache.org/docs/programming-guide) + + driver.RegisterUpdater<FooUpdater>("FooUpdater"); + +Users can then configure the job as + + # in job.conf + updater { + user_type: "FooUpdater" # must use user_type with the same string identifier as the one used for registration + fooupdater_conf { + c : 20; + } + } + +### Implementing a new LRGenerator subclass + +The base `LRGenerator` is declared as, + + virtual float Get(int step); + +To implement a subclass, e.g., `FooLRGen`, users should declare it like + + class FooLRGen : public LRGenerator { + public: + float Get(int step) override; + }; + +Configuration of `FooLRGen` can be defined using a protocol message, + + # in user.proto + message FooLRProto { + ... + } + + extend LRGenProto { + optional FooLRProto foolr_conf = 101; + } + +The configuration is then like, + + learning_rate { + user_type : "FooLR" # must use user_type with the same string identifier as the one used for registration + base_lr: float + foolr_conf { + ... + } + } + +Users have to register this subclass in the main function, + + driver.RegisterLRGenerator<FooLRGen>("FooLR")
