Modified: incubator/singa/site/trunk/content/markdown/docs/rnn.md URL: http://svn.apache.org/viewvc/incubator/singa/site/trunk/content/markdown/docs/rnn.md?rev=1703880&r1=1703879&r2=1703880&view=diff ============================================================================== --- incubator/singa/site/trunk/content/markdown/docs/rnn.md (original) +++ incubator/singa/site/trunk/content/markdown/docs/rnn.md Fri Sep 18 15:10:58 2015 @@ -1,157 +1,148 @@ -# RNN Example +Recurrent Neural Networks for Language Modelling +--- -Recurrent Neural Networks (RNN) are widely used for modeling sequential data, -such as music, videos and sentences. In this example, we use SINGA to train a +Recurrent Neural Networks (RNN) are widely used for modelling sequential data, +such as music and sentences. In this example, we use SINGA to train a [RNN model](http://www.fit.vutbr.cz/research/groups/speech/publi/2010/mikolov_interspeech2010_IS100722.pdf) proposed by Tomas Mikolov for [language modeling](https://en.wikipedia.org/wiki/Language_model). The training objective (loss) is -minimize the [perplexity per word](https://en.wikipedia.org/wiki/Perplexity), which +to minimize the [perplexity per word](https://en.wikipedia.org/wiki/Perplexity), which is equivalent to maximize the probability of predicting the next word given the current word in a sentence. -Different to the [CNN](http://singa.incubator.apache.org/docs/cnn), [MLP](http://singa.incubator.apache.org/docs/mlp) -and [RBM](http://singa.incubator.apache.org/docs/rbm) examples which use built-in -[Layer](http://singa.incubator.apache.org/docs/layer)s and [Record](http://singa.incubator.apache.org/docs/data)s, -none of the layers in this model is built-in. Hence users can get examples of -implementing their own Layers and data Records in this page. +Different to the [CNN](cnn.html), [MLP](mlp.html) +and [RBM](rbm.html) examples which use built-in +layers(layer) and records(data), +none of the layers in this example are built-in. Hence users would learn to +implement their own layers and data records through this example. ## Running instructions -In *SINGA_ROOT/examples/rnn/*, scripts are provided to run the training job. +In *SINGA_ROOT/examples/rnnlm/*, scripts are provided to run the training job. First, the data is prepared by $ cp Makefile.example Makefile $ make download $ make create -Second, the training is started by passing the job configuration as, +Second, to compile the source code under *examples/rnnlm/*, run - # in SINGA_ROOT - $ ./bin/singa-run.sh -conf SINGA_ROOT/examples/rnn/job.conf + $ make rnnlm +An executable file *rnnlm.bin* will be generated. +Third, the training is started by passing *rnnlm.bin* and the job configuration +to *singa-run.sh*, + + # at SINGA_ROOT/ + # export LD_LIBRARY_PATH=.libs:$LD_LIBRARY_PATH + $ ./bin/singa-run.sh -exec examples/rnnlm/rnnlm.bin -conf examples/rnnlm/job.conf ## Implementations -<img src="http://singa.incubator.apache.org/images/rnn-refine.png" align="center" width="300px"/> +<img src="../images/rnnlm.png" align="center" width="400px"/> <span><strong>Figure 1 - Net structure of the RNN model.</strong></span> -The neural net structure is shown Figure 1. -Word records are loaded by `RnnlmDataLayer` from `WordShard`. `RnnlmWordparserLayer` -parses word records to get word indexes (in the vocabulary). For every iteration, -`window_size` words are processed. `RnnlmWordinputLayer` looks up a word -embedding matrix to extract feature vectors for words in the window. -These features are transformed by `RnnlmInnerproductLayer` layer and `RnnlmSigmoidLayer`. -`RnnlmSigmoidLayer` is a recurrent layer that forwards features from previous words -to next words. Finally, `RnnlmComputationLayer` computes the perplexity loss with -word class information from `RnnlmClassparserLayer`. The word class is a cluster ID. -Words are clustered based on their frequency in the dataset, e.g., frequent words -are clustered together and less frequent words are clustered together. Clustering -is to improve the efficiency of the final prediction process. +The neural net structure is shown Figure 1. Word records are loaded by +`DataLayer`. For every iteration, at most `max_window` word records are +processed. If a sentence ending character is read, the `DataLayer` stops +loading immediately. `EmbeddingLayer` looks up a word embedding matrix to extract +feature vectors for words loaded by the `DataLayer`. These features are transformed by the +`HiddenLayer` which propagates the features from left to right. The +output feature for word at position k is influenced by words from position 0 to +k-1. Finally, `LossLayer` computes the cross-entropy loss (see below) +by predicting the next word of each word. +`LabelLayer` reads the same number of word records as the embedding layer but starts from +position 1. Consequently, the word record at position k in `LabelLayer` is the ground +truth for the word at position k in `LossLayer`. + +The cross-entropy loss is computed as + +`$$L(w_t)=-log P(w_{t+1}|w_t)$$` + +Given `$w_t$` the above equation would compute over all words in the vocabulary, +which is time consuming. +[RNNLM Toolkit](https://f25ea9ccb7d3346ce6891573d543960492b92c30.googledrive.com/host/0ByxdPXuxLPS5RFM5dVNvWVhTd0U/rnnlm-0.4b.tgz) +accelerates the computation as + +`$$P(w_{t+1}|w_t) = P(C_{w_{t+1}}|w_t) * P(w_{t+1}|C_{w_{t+1}})$$` + +Words from the vocabulary are partitioned into a user-defined number of classes. +The first term on the left side predicts the class of the next word, and +then predicts the next word given its class. Both the number of classes and +the words from one class are much smaller than the vocabulary size. The probabilities +can be calculated much faster. + +The perplexity per word is computed by, + +`$$PPL = 10^{- avg_t log_{10} P(w_{t+1}|w_t)}$$` ### Data preparation -We use a small dataset in this example. In this dataset, [dataset description, e.g., format]. +We use a small dataset provided by the [RNNLM Toolkit](https://f25ea9ccb7d3346ce6891573d543960492b92c30.googledrive.com/host/0ByxdPXuxLPS5RFM5dVNvWVhTd0U/rnnlm-0.4b.tgz). +It has 10,000 training sentences, with 71350 words in total and 3720 unique words. The subsequent steps follow the instructions in -[Data Preparation](http://singa.incubator.apache.org/docs/data) to convert the -raw data into `Record`s and insert them into `DataShard`s. +[Data Preparation](data.html) to convert the +raw data into records and insert them into `DataShard`s. #### Download source data - # in SINGA_ROOT/examples/rnn/ - wget http://www.fit.vutbr.cz/~imikolov/rnnlm/simple-examples.tgz - xxx - + # in SINGA_ROOT/examples/rnnlm/ + cp Makefile.example Makefile + make download -#### Define your own record. +#### Define your own record -Since this dataset has different format as the built-in `SingleLabelImageRecord`, -we need to extend the base `Record` to add new fields, +We define the word record as follows, - # in SINGA_ROOT/examples/rnn/user.proto - package singa; - - import "common.proto"; // import SINGA Record - - extend Record { // extend base Record to include users' records - optional WordClassRecord wordclass = 101; - optional SingleWordRecord singleword = 102; + # in SINGA_ROOT/examples/rnnlm/rnnlm.proto + message WordRecord { + optional string word = 1; + optional int32 word_index = 2; + optional int32 class_index = 3; + optional int32 class_start = 4; + optional int32 class_end = 5; } - message WordClassRecord { - optional int32 class_index = 1; // the index of this class - optional int32 start = 2; // the index of the start word in this class; - optional int32 end = 3; // the index of the end word in this class + extend singa.Record { + optional WordRecord word = 101; } - message SingleWordRecord { - optional string word = 1; - optional int32 word_index = 2; // the index of this word in the vocabulary - optional int32 class_index = 3; // the index of the class corresponding to this word - } - - -#### Create data shard for training and testing - -{% comment %} -As the vocabulary size is very large, the original perplexity calculation method -is time consuming. Because it has to calculate the probabilities of all possible -words for - - p(wt|w0, w1, ... wt-1). - - -Tomas proposed to divide all -words into different classes according to the word frequency, and compute the -perplexity according to +It includes the word string and its index in the vocabulary. +Words in the vocabulary are sorted based on their frequency in the training dataset. +The sorted list is cut into 100 sublists such that each sublist has 1/100 total +word frequency. Each sublist is called a class. +Hence each word has a `class_index` ([0,100)). The `class_start` is the index +of the first word in the same class as `word`. The `class_end` is the index of +the first word in the next class. - p(wt|w0, w1, ... wt-1) = p(c|w0,w1,..wt-1) p(w|c) +#### Create DataShards -where `c` is the word class, `w0, w1...wt-1` are the previous words before `wt`. -The probabilities on the right side can be computed faster than +We use code from RNNLM Toolkit to read words, and sort them into classes. +The main function in *create_shard.cc* first creates word classes based on the training +dataset. Second it calls the following function to create data shards for the +training, validation and test dataset. + int create_shard(const char *input_file, const char *output_file); -[Makefile](https://github.com/kaiping/incubator-singa/blob/rnnlm/examples/rnnlm/Makefile) -for creating the shards (see in - [create_shard.cc](https://github.com/kaiping/incubator-singa/blob/rnnlm/examples/rnnlm/create_shard.cc)), - we need to specify where to download the source data, number of classes we - want to divide all occurring words into, and all the shards together with - their names, directories we want to create. -{% endcomment %} - -*SINGA_ROOT/examples/rnn/create_shard.cc* defines the following function for creating data shards, - - void create_shard(const char *input, int nclass) { - -`input` is the path to [the text file], `nclass` is user specified cluster size. +`input` is the path to training/validation/testing text file from the RNNLM Toolkit, `output` is output shard folder. This function starts with - using StrIntMap = std::map<std::string, int>; - StrIntMap *wordIdxMapPtr; // Mapping word string to a word index - StrIntMap *wordClassIdxMapPtr; // Mapping word string to a word class index - if (-1 == nclass) { - loadClusterForNonTrainMode(input, nclass, &wordIdxMap, &wordClassIdxMap); // non-training phase - } else { - doClusterForTrainMode(input, nclass, &wordIdxMap, &wordClassIdxMap); // training phase - } + DataShard dataShard(output, DataShard::kCreate); +Then it reads the words one by one. For each word it creates a `WordRecord` instance, +and inserts it into the `dataShard`. - * If `-1 == nclass`, `path` points to the training data file. `doClusterForTrainMode` - reads all the words in the file to create the two maps. [The two maps are stored in xxx] - * otherwise, `path` points to either test or validation data file. `loadClusterForNonTrainMode` - loads the two maps from [xxx]. - -Words from training/text/validation files are converted into `Record`s by - - singa::SingleWordRecord *wordRecord = record.MutableExtension(singa::singleword); - while (in >> word) { - wordRecord->set_word(word); - wordRecord->set_word_index(wordIdxMap[word]); - wordRecord->set_class_index(wordClassIdxMap[word]); - snprintf(key, kMaxKeyLength, "%08d", wordIdxMap[word]); - wordShard.Insert(std::string(key), record); - } + int wcnt = 0; // word count + singa.Record record; + WordRecord* wordRecord = record.MutableExtension(word); + while(1) { + readWord(wordstr, fin); + if (feof(fin)) break; + ...// fill in the wordRecord; + int length = snprintf(key, BUFFER_LEN, "%05d", wcnt++); + dataShard.Insert(string(key, length), record); } Compilation and running commands are provided in the *Makefile.example*. @@ -159,403 +150,299 @@ After executing make create -, three data shards will created using the `create_shard.cc`, namely, -*rnnlm_word_shard_train*, *rnnlm_word_shard_test* and *rnnlm_word_shard_valid*. +, three data shards will created, namely, +*train_shard*, *test_shard* and *valid_shard*. ### Layer implementation -7 layers (i.e., Layer subclasses) are implemented for this application, -including 1 [data layer](http://singa.incubator.apache.org/docs/layer#data-layers) which fetches data records from data -shards, 2 [parser layers](http://singa.incubator.apache.org/docs/layer#parser-layers) which parses the input records, 3 neuron layers -which transforms the word features and 1 loss layer which computes the -objective loss. - -First, we illustrate the data shard and how to create it for this application. Then, we -discuss the configuration and functionality of layers. Finally, we introduce how -to configure a job and then run the training for your own model. - -Following the guide for implementing [new Layer subclasses](http://singa.incubator.apache.org/docs/layer#implementing-a-new-layer-subclass), -we extend the [LayerProto](http://singa.incubator.apache.org/api/classsinga_1_1LayerProto.html) -to include the configuration message of each user-defined layer as shown below -(5 out of the 7 layers have specific configurations), +6 user-defined layers are implemented for this application. +Following the guide for implementing [new Layer subclasses](layer#implementing-a-new-layer-subclass), +we extend the [LayerProto](../api/classsinga_1_1LayerProto.html) +to include the configuration messages of user-defined layers as shown below +(3 out of the 7 layers have specific configurations), - package singa; - import "common.proto"; // Record message for SINGA is defined import "job.proto"; // Layer message for SINGA is defined //For implementation of RNNLM application - extend LayerProto { - optional RnnlmComputationProto rnnlmcomputation_conf = 201; - optional RnnlmSigmoidProto rnnlmsigmoid_conf = 202; - optional RnnlmInnerproductProto rnnlminnerproduct_conf = 203; - optional RnnlmWordinputProto rnnlmwordinput_conf = 204; - optional RnnlmDataProto rnnlmdata_conf = 207; - } - - -In the subsequent sections, we describe the implementation of each layer, including -it configuration message. - -### RnnlmDataLayer - -It inherits [DataLayer](/api/classsinga_1_1DataLayer.html) for loading word and -class `Record`s from `DataShard`s into memory. - -#### Functionality - - void RnnlmDataLayer::Setup() { - read records from ClassShard to construct mapping from word string to class index - Resize length of records_ as window_size + 1 - Read 1st word record to the last position - } - - - void RnnlmDataLayer::ComputeFeature() { - records_[0] = records_[windowsize_]; //Copy the last record to 1st position in the record vector - Assign values to records_; //Read window_size new word records from WordShard - } - - -The `Steup` function load the mapping (from word string to class index) from -*ClassShard*. - -Every time the `ComputeFeature` function is called, it loads `windowsize_` records -from `WordShard`. - - -[For the consistency -of operations at each training iteration, it maintains a record vector (length -of window_size + 1). It reads the 1st record from the WordShard and puts it in -the last position of record vector]. - - -#### Configuration - - message RnnlmDataProto { - required string class_path = 1; // path to the class data file/folder, absolute or relative to the workspace - required string word_path = 2; // path to the word data file/folder, absolute or relative to the workspace - required int32 window_size = 3; // window size. - } - -[class_path to file or folder?] - -[There two paths, `class_path` for ...; `word_path` for.. -The `window_size` is set to ...] - - -### RnnlmWordParserLayer - -This layer gets `window_size` word strings from the `RnnlmDataLayer` and looks -up the word string to word index map to get word indexes. - -#### Functionality - - void RnnlmWordparserLayer::Setup(){ - Obtain window size from src layer; - Obtain vocabulary size from src layer; - Reshape data_ as {window_size}; - } - - void RnnlmWordparserLayer::ParseRecords(Blob* blob){ - for each word record in the window, get its word index and insert the index into blob - } - - -#### Configuration - -This layer does not have specific configuration fields. - - -### RnnlmClassParserLayer - -It maps each word in the processing window into a class index. - -#### Functionality - - void RnnlmClassparserLayer::Setup(){ - Obtain window size from src layer; - Obtain vocaubulary size from src layer; - Obtain class size from src layer; - Reshape data_ as {windowsize_, 4}; - } - - void RnnlmClassparserLayer::ParseRecords(){ - for(int i = 1; i < records.size(); i++){ - Copy starting word index in this class to data[i]'s 1st position; - Copy ending word index in this class to data[i]'s 2nd position; - Copy index of input word to data[i]'s 3rd position; - Copy class index of input word to data[i]'s 4th position; + extend singa.LayerProto { + optional EmbeddingProto embedding_conf = 101; + optional LossProto loss_conf = 102; + optional InputProto input_conf = 103; + } + +In the subsequent sections, we describe the implementation of each layer, +including its configuration message. + +#### RNNLayer + +This is the base layer of all other layers for this applications. It is defined +as follows, + + class RNNLayer : virtual public Layer { + public: + inline int window() { return window_; } + protected: + int window_; + }; + +For this application, two iterations may process different number of words. +Because sentences have different lengths. +The `DataLayer` decides the effective window size. All other layers call its source layers to get the +effective window size and resets `window_` in `ComputeFeature` function. + +#### DataLayer + +DataLayer is for loading Records. + + class DataLayer : public RNNLayer, singa::DataLayer { + public: + void Setup(const LayerProto& proto, int npartitions) override; + void ComputeFeature(int flag, Metric *perf) override; + int max_window() const { + return max_window_; } - } - -The setup function read - - -#### Configuration -This layer fetches the class information (the mapping information between -classes and words) from RnnlmDataLayer and maintains this information as data -in this layer. - - - -Next, this layer parses the last "window_size" number of word records from -RnnlmDataLayer and stores them as data. Then, it retrieves the corresponding -class for each input word. It stores the starting word index of this class, -ending word index of this class, word index and class index respectively. + private: + int max_window_; + singa::DataShard* shard_; + }; +The Setup function gets the user configured max window size. Since this application +predicts the next word for each input word, the record vector is resized to +have max_window+1 records, where the k-th record is loaded as the ground +truth label for the (k-1)-th record. -### RnnlmWordInputLayer + max_window_ = proto.GetExtension(input_conf).max_window(); + records_.resize(max_window_ + 1); -Using the input word records, this layer obtains corresponding word vectors as -its data. Then, it passes the data to RnnlmInnerProductLayer above for further -processing. +The `ComputeFeature` function loads at most max_window records. It could also +stop when the sentence ending character is encountered. -#### Configuration -In this layer, the length of each word vector needs to be configured. Besides, -whether to use bias term during the training process should also be configured -(See more in -[job.proto](https://github.com/kaiping/incubator-singa/blob/rnnlm/src/proto/job.proto)). - - message RnnlmWordinputProto { - required int32 word_length = 1; // vector length for each input word - optional bool bias_term = 30 [default = true]; // use bias vector or not + records_[0] = records_[window_]; // shift the last record to the first + window_ = max_window_; + for (int i = 1; i <= max_window_; i++) { + // load record; break if it is the ending character } -#### Functionality -In setup phase, this layer first reshapes its members such as "data", "grad", -and "weight" matrix. Then, it obtains the vocabulary size from its source layer -(i.e., RnnlmWordParserLayer). - -In the forward phase, using the "window_size" number of input word indices, the -"window_size" number of word vectors are selected from this layer's weight -matrix, each word index corresponding to one row. +The configuration of `DataLayer` is like - void RnnlmWordinputLayer::ComputeFeature() { - for(int t = 0; t < windowsize_; t++){ - data[t] = weight[src[t]]; - } + name: "data" + user_type: "kData" + [input_conf] { + path: "examples/rnnlm/train_shard" + max_window: 10 } -In the backward phase, after computing this layer's gradient in its destination -layer (i.e., RnnlmInnerProductLayer), here the gradient of the weight matrix in -this layer is copied (by row corresponding to word indices) from this layer's -gradient. - - void RnnlmWordinputLayer::ComputeGradient() { - for(int t = 0; t < windowsize_; t++){ - gweight[src[t]] = grad[t]; - } - } - - -### RnnlmInnerProductLayer +#### EmbeddingLayer -This is a neuron layer which receives the data from RnnlmWordInputLayer and -sends the computation results to RnnlmSigmoidLayer. +This layer gets records from `DataLayer`. For each record, the word index is +parsed and used to get the corresponding word feature vector from the embedding +matrix. -#### Configuration -In this layer, the number of neurons needs to be specified. Besides, whether to -use a bias term should also be configured. +The class is declared as follows, - message RnnlmInnerproductProto { - required int32 num_output = 1; //Number of outputs for the layer - optional bool bias_term = 30 [default = true]; //Use bias vector or not + class EmbeddingLayer : public RNNLayer { + ... + const std::vector<Param*> GetParams() const override { + std::vector<Param*> params{embed_}; + return params; + } + private: + int word_dim_, vocab_size_; + Param* embed_; + } + +The `embed_` field is a matrix whose values are parameter to be learned. +The matrix size is `vocab_size_` x `word_dim_`. + +The Setup function reads configurations for `word_dim_` and `vocab_size_`. Then +it allocates feature Blob for `max_window` words and setups `embed_`. + + int max_window = srclayers_[0]->data(this).shape()[0]; + word_dim_ = proto.GetExtension(embedding_conf).word_dim(); + data_.Reshape(vector<int>{max_window, word_dim_}); + ... + embed_->Setup(vector<int>{vocab_size_, word_dim_}); + +The `ComputeFeature` function simply copies the feature vector from the `embed_` +matrix into the feature Blob. + + # reset effective window size + window_ = datalayer->window(); + auto records = datalayer->records(); + ... + for (int t = 0; t < window_; t++) { + int idx = static_cast<int>(records[t].GetExtension(word).word_index()); + Copy(words[t], embed[idx]); + } + +The `ComputeGradient` function copies back the gradients to the `embed_` matrix. + +The configuration for `EmbeddingLayer` is like, + + user_type: "kEmbedding" + [embedding_conf] { + word_dim: 15 + vocab_size: 3720 + } + srclayers: "data" + param { + name: "w1" + init { + type: kUniform + low:-0.3 + high:0.3 + } } -#### Functionality - -In the forward phase, this layer is in charge of executing the dot -multiplication between its weight matrix and the data in its source layer -(i.e., RnnlmWordInputLayer). +#### LabelLayer - void RnnlmInnerproductLayer::ComputeFeature() { - data = dot(src, weight); //Dot multiplication operation - } +Since the label of records[i] is records[i+1]. +This layer fetches the effective window records starting from position 1. +It converts each record into a tuple (word_class_start, word_class_end, word_index, class_index). -In the backward phase, this layer needs to first compute the gradient of its -source layer (i.e., RnnlmWordInputLayer). Then, it needs to compute the -gradient of its weight matrix by aggregating computation results for each -timestamp. The details can be seen as follows. - void RnnlmInnerproductLayer::ComputeGradient() { - for (int t = 0; t < windowsize_; t++) { - Add the dot product of src[t] and grad[t] to gweight; - } - Copy the dot product of grad and weight to gsrc; + for (int i = 0; i < window_; i++) { + WordRecord wordrecord = records[i + 1].GetExtension(word); + label[4 * i + 0] = wordrecord.class_start(); + label[4 * i + 1] = wordrecord.class_end(); + label[4 * i + 2] = wordrecord.word_index(); + label[4 * i + 3] = wordrecord.class_index(); } -### RnnlmSigmoidLayer +There is no special configuration for this layer. -This is a neuron layer for computation. During the computation in this layer, -each component of the member data specific to one timestamp uses its previous -timestamp's data component as part of the input. This is how the time-order -information is utilized in this language model application. +#### HiddenLayer -Besides, if you want to implement a recurrent neural network following our -design, this layer is of vital importance for you to refer to. Also, you can -always think of other design methods to make use of information from past -timestamps. +This layer unrolls the recurrent connections for at most max_window times. +The feature for position k is computed based on the feature from the embedding layer (position k) +and the feature at position k-1 of this layer. The formula is -#### Configuration +`$$f[k]=\sigma (f[t-1]*W+src[t])$$` -In this layer, whether to use a bias term needs to be specified. +where `$W$` is a matrix with `word_dim_` x `word_dim_` parameters. - message RnnlmSigmoidProto { - optional bool bias_term = 1 [default = true]; // use bias vector or not - } +If you want to implement a recurrent neural network following our +design, this layer is of vital importance for you to refer to. -#### Functionality - -In the forward phase, this layer first receives data from its source layer -(i.e., RnnlmInnerProductLayer) which is used as one part input for computation. -Then, for each timestampe this layer executes a dot multiplication between its -previous timestamp information and its own weight matrix. The results are the -other part for computation. This layer sums these two parts together and -executes an activation operation. The detailed descriptions for this process -are illustrated as follows. - - void RnnlmSigmoidLayer::ComputeFeature() { - for(int t = 0; t < window_size; t++){ - if(t == 0) Copy the sigmoid results of src[t] to data[t]; - else Compute the dot product of data[t - 1] and weight, and add sigmoid results of src[t] to be data[t]; - } + class HiddenLayer : public RNNLayer { + ... + const std::vector<Param*> GetParams() const override { + std::vector<Param*> params{weight_}; + return params; + } + private: + Param* weight_; + }; + +The `Setup` function setups the weight matrix as + + weight_->Setup(std::vector<int>{word_dim, word_dim}); + +The `ComputeFeature` function gets the effective window size (`window_`) from its source layer +i.e., the embedding layer. Then it propagates the feature from position 0 to position +`window_` -1. The detailed descriptions for this process are illustrated as follows. + + void HiddenLayer::ComputeFeature() { + for(int t = 0; t < window_size; t++){ + if(t == 0) + Copy(data[t], src[t]); + else + data[t]=sigmoid(data[t-1]*W + src[t]); + } } -In the backward phase, this RnnlmSigmoidLayer first updates this layer's member -grad utilizing the information from current timestamp's next timestamp. Then -respectively, this layer computes the gradient for its weight matrix and its -source layer RnnlmInnerProductLayer by iterating different timestamps. The -process can be seen below. - - void RnnlmSigmoidLayer::ComputeGradient(){ - Update grad[t]; // Update the gradient for the current layer, add a new term from next timestamp - for (int t = 0; t < windowsize_; t++) { - Update gweight; // Compute the gradient for the weight matrix - Compute gsrc[t]; // Compute the gradient for src layer +The `ComputeGradient` function computes the gradient of the loss w.r.t. W and the source layer. +Particularly, for each position k, since data[k] contributes to data[k+1] and the feature +at position k in its destination layer (the loss layer), grad[k] should contains the gradient +from two parts. The destination layer has already computed the gradient from the loss layer into +grad[k]; In the `ComputeGradient` function, we need to add the gradient from position k+1. + + void HiddenLayer::ComputeGradient(){ + ... + for (int k = window_ - 1; k >= 0; k--) { + if (k < window_ - 1) { + grad[k] += dot(grad[k + 1], weight.T()); // add gradient from position t+1. } + grad[k] =... // compute gL/gy[t], y[t]=data[t-1]*W+src[t] + } + gweight = dot(data.Slice(0, window_-1).T(), grad.Slice(1, window_)); + Copy(gsrc, grad); } +After the loop, we get the gradient of the loss w.r.t y[k], which is used to +compute the gradient of W and the src[k]. +#### LossLayer -### RnnlmComputationLayer - -This layer is a loss layer in which the performance metrics, both the -probability of predicting the next word correctly, and perplexity (PPL in -short) are computed. To be specific, this layer is composed of the class -information part and the word information part. Therefore, the computation can -be essentially divided into two parts by slicing this layer's weight matrix. +This layer computes the cross-entropy loss and the `$log_{10}P(w_{t+1}|w_t)$` (which +could be averaged over all words by users to get the PPL value). -#### Configuration +There are two configuration fields to be specified by users. -In this layer, it is needed to specify whether to use a bias term during -training. - - message RnnlmComputationProto { - optional bool bias_term = 1 [default = true]; // use bias vector or not + message LossProto { + optional int32 nclass = 1; + optional int32 vocab_size = 2; } +There are two weight matrices to be learned -#### Functionality - -In the forward phase, by using the two sliced weight matrices (one is for class -information, another is for the words in this class), this -RnnlmComputationLayer calculates the dot product between the source layer's -input and the sliced matrices. The results can be denoted as "y1" and "y2". -Then after a softmax function, for each input word, the probability -distribution of classes and the words in this classes are computed. The -activated results can be denoted as p1 and p2. Next, using the probability -distribution, the PPL value is computed. - - void RnnlmComputationLayer::ComputeFeature() { - Compute y1 and y2; - p1 = Softmax(y1); - p2 = Softmax(y2); - Compute perplexity value PPL; + class LossLayer : public RNNLayer { + ... + private: + Param* word_weight_, *class_weight_; } +The ComputeFeature function computes the two probabilities respectively. -In the backward phase, this layer executes the following three computation -operations. First, it computes the member gradient of the current layer by each -timestamp. Second, this layer computes the gradient of its own weight matrix by -aggregating calculated results from all timestamps. Third, it computes the -gradient of its source layer, RnnlmSigmoidLayer, timestamp-wise. +`$$P(C_{w_{t+1}}|w_t) = Softmax(w_t * class\_weight_)$$` +`$$P(w_{t+1}|C_{w_{t+1}}) = Softmax(w_t * word\_weight[class\_start:class\_end])$$` - void RnnlmComputationLayer::ComputeGradient(){ - Compute grad[t] for all timestamps; - Compute gweight by aggregating results computed in different timestamps; - Compute gsrc[t] for all timestamps; - } +`$w_t$` is the feature from the hidden layer for the k-th word, its ground truth +next word is `$w_{t+1}$`. The first equation computes the probability distribution over all +classes for the next word. The second equation computes the +probability distribution over the words in the ground truth class for the next word. +The ComputeGradient function computes the gradient of the source layer +(i.e., the hidden layer) and the two weight matrices. -## Updater Configuration +### Updater Configuration We employ kFixedStep type of the learning rate change method and the -configuration is as follows. We use different learning rate values in different -step ranges. [Here](http://wangwei-pc.d1.comp.nus.edu.sg:4000/docs/updater/) is -more information about choosing updaters. +configuration is as follows. We decay the learning rate once the performance does +not increase on the validation dataset. updater{ - #weight_decay:0.0000001 - lr_change: kFixedStep - type: kSGD + type: kSGD + learning_rate { + type: kFixedStep fixedstep_conf:{ step:0 - step:42810 - step:49945 - step:57080 - step:64215 + step:48810 + step:56945 + step:65080 + step:73215 step_lr:0.1 step_lr:0.05 step_lr:0.025 step_lr:0.0125 step_lr:0.00625 } + } } - -## TrainOneBatch() Function +### TrainOneBatch() Function We use BP (BackPropagation) algorithm to train the RNN model here. The corresponding configuration can be seen below. # In job.conf file - alg: kBackPropagation - -Refer to -[here](http://wangwei-pc.d1.comp.nus.edu.sg:4000/docs/train-one-batch/) for -more information on different TrainOneBatch() functions. - -## Cluster Configuration - -In this RNN language model, we configure the cluster topology as follows. - - cluster { - nworker_groups: 1 - nserver_groups: 1 - nservers_per_group: 1 - nworkers_per_group: 1 - nservers_per_procs: 1 - nworkers_per_procs: 1 - workspace: "examples/rnnlm/" + train_one_batch { + alg: kBackPropagation } -This is to train the model in one node. For other configuration choices, please -refer to [here](http://wangwei-pc.d1.comp.nus.edu.sg:4000/docs/frameworks/). - - -## Configure Job - -Job configuration is written in "job.conf". - -Note: Extended field names should be embraced with square-parenthesis [], e.g., [singa.rnnlmdata_conf]. - - -## Run Training - -Start training by the following commands - - cd SINGA_ROOT - ./bin/singa-run.sh -workspace=examples/rnnlm +### Cluster Configuration +The default cluster configuration can be used, i.e., single worker and single server +in a single process.
Modified: incubator/singa/site/trunk/content/markdown/docs/updater.md URL: http://svn.apache.org/viewvc/incubator/singa/site/trunk/content/markdown/docs/updater.md?rev=1703880&r1=1703879&r2=1703880&view=diff ============================================================================== --- incubator/singa/site/trunk/content/markdown/docs/updater.md (original) +++ incubator/singa/site/trunk/content/markdown/docs/updater.md Fri Sep 18 15:10:58 2015 @@ -1,5 +1,7 @@ # Updater +--- + Every server in SINGA has an [Updater](api/classsinga_1_1Updater.html) instance that updates parameters based on gradients. In this page, the *Basic user guide* describes the configuration of an updater. @@ -33,7 +35,7 @@ Users need to configure at least the `le momentum: float weight_decay: float learning_rate { - + ... } } @@ -192,7 +194,7 @@ where step[k] is the smallest number tha ## Advanced user guide -### Implementing a new Update subclass +### Implementing a new Updater subclass The base Updater class has one virtual function, @@ -279,4 +281,4 @@ The configuration is then like, Users have to register this subclass in the main function, - driver.RegisterLRGenerator<FooLRGen>("FooLR") + driver.RegisterLRGenerator<FooLRGen, std::string>("FooLR") Modified: incubator/singa/site/trunk/content/markdown/index.md URL: http://svn.apache.org/viewvc/incubator/singa/site/trunk/content/markdown/index.md?rev=1703880&r1=1703879&r2=1703880&view=diff ============================================================================== --- incubator/singa/site/trunk/content/markdown/index.md (original) +++ incubator/singa/site/trunk/content/markdown/index.md Fri Sep 18 15:10:58 2015 @@ -1,28 +1,29 @@ ### Getting Started -* The [Introduction](http://singa.incubator.apache.org/docs/overview.html) page gives an overview of SINGA. +* The [Introduction](docs/overview.html) page gives an overview of SINGA. -* The [Installation](http://singa.incubator.apache.org/docs/installation.html) +* The [Installation](docs/installation.html) guide describes details on downloading and installing SINGA. -* Please follow the [Quick Start](http://singa.incubator.apache.org/docs/quick-start.html) +* Please follow the [Quick Start](docs/quick-start.html) guide to run simple applications on SINGA. ### Documentation -* Documentations are listed [here](http://singa.incubator.apache.org/docs.html). -* Code API can be found [here](http://singa.incubator.apache.org/api/index.html). -* Research publication list is available [here](http://singa.incubator.apache.org/research/publication). +* Documentations are listed [here](docs.html). +* Code API can be found [here](api/index.html). +* Research publication list is available [here](http://www.comp.nus.edu.sg/~dbsystem/singa//research/publication/). ### How to contribute * Please subscribe to our development mailing list [email protected]. * If you find any issues using SINGA, please report it to the [Issue Tracker](https://issues.apache.org/jira/browse/singa). -* You can also contact with [SINGA committers](http://singa.incubator.apache.org/dev/community) directly. +* You can also contact with [SINGA committers](dev/community) directly. More details on contributing to SINGA is described [here](dev/contribute). ### Recent News +* SINGA was presented in a [workshop on deep learning](http://www.comp.nus.edu.sg/~dbsystem/singa/workshop) held on 16 Sep, 2015 * SINGA will be presented at [BOSS](http://boss.dima.tu-berlin.de/) of [VLDB 2015](http://www.vldb.org/2015/) at Hawai'i, 4 Sep, 2015. (slides: [overview](files/singa-vldb-boss.pptx), @@ -51,12 +52,11 @@ Please cite the following two papers if * B. C. Ooi, K.-L. Tan, S. Wang, W. Wang, Q. Cai, G. Chen, J. Gao, Z. Luo, A. K. H. Tung, Y. Wang, Z. Xie, M. Zhang, and K. Zheng. [SINGA: A distributed deep learning platform](http://www.comp.nus.edu.sg/~ooibc/singaopen-mm15.pdf). ACM Multimedia - (Open Source Software Competition) 2015 ([BibTex](http://singa.incubator.apache.org/assets/file/bib-oss.txt)). + (Open Source Software Competition) 2015 ([BibTex](http://www.comp.nus.edu.sg/~dbsystem/singa//assets/file/bib-oss.txt)). * W. Wang, G. Chen, T. T. A. Dinh, B. C. Ooi, K.-L.Tan, J. Gao, and S. Wang. [SINGA:putting deep learning in the hands of multimedia users](http://www.comp.nus.edu.sg/~ooibc/singa-mm15.pdf). -ACM Multimedia 2015 ([BibTex](http://singa.incubator.apache.org/assets/file/bib-singa.txt)). +ACM Multimedia 2015 ([BibTex](http://www.comp.nus.edu.sg/~dbsystem/singa//assets/file/bib-singa.txt)). ### License SINGA is released under [Apache License Version 2.0](http://www.apache.org/licenses/LICENSE-2.0). - Modified: incubator/singa/site/trunk/content/site.xml URL: http://svn.apache.org/viewvc/incubator/singa/site/trunk/content/site.xml?rev=1703880&r1=1703879&r2=1703880&view=diff ============================================================================== --- incubator/singa/site/trunk/content/site.xml (original) +++ incubator/singa/site/trunk/content/site.xml Fri Sep 18 15:10:58 2015 @@ -51,6 +51,15 @@ <item name="Welcome" href="index.html"/> </menu> + <head> + <script type="text/javascript" + src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"> + </script> + <script type="text/x-mathjax-config"> + MathJax.Hub.Config({tex2jax: {inlineMath: [['$','$'], ['\\(','\\)']]}}); + </script> + </head> + <menu name="Documentaion"> <item name="Introduction" href="docs/overview.html"/> <item name="Installation" href="docs/installation.html"/>
