Added: incubator/singa/site/trunk/content/markdown/v0.3.0/zh/neural-net.md URL: http://svn.apache.org/viewvc/incubator/singa/site/trunk/content/markdown/v0.3.0/zh/neural-net.md?rev=1740048&view=auto ============================================================================== --- incubator/singa/site/trunk/content/markdown/v0.3.0/zh/neural-net.md (added) +++ incubator/singa/site/trunk/content/markdown/v0.3.0/zh/neural-net.md Wed Apr 20 05:09:06 2016 @@ -0,0 +1,327 @@ +# Neural Net + +--- + +`NeuralNet` in SINGA represents an instance of user's neural net model. As the +neural net typically consists of a set of layers, `NeuralNet` comprises +a set of unidirectionally connected [Layer](layer.html)s. +This page describes how to convert an user's neural net into +the configuration of `NeuralNet`. + +<img src="../images/model-category.png" align="center" width="200px"/> +<span><strong>Figure 1 - Categorization of popular deep learning models.</strong></span> + +## Net structure configuration + +Users configure the `NeuralNet` by listing all layers of the neural net and +specifying each layer's source layer names. Popular deep learning models can be +categorized as Figure 1. The subsequent sections give details for each +category. + +### Feed-forward models + +<div align = "left"> +<img src="../images/mlp-net.png" align="center" width="200px"/> +<span><strong>Figure 2 - Net structure of a MLP model.</strong></span> +</div> + +Feed-forward models, e.g., CNN and MLP, can easily get configured as their layer +connections are undirected without circles. The +configuration for the MLP model shown in Figure 1 is as follows, + + net { + layer { + name : 'data" + type : kData + } + layer { + name : 'image" + type : kImage + srclayer: 'data' + } + layer { + name : 'label" + type : kLabel + srclayer: 'data' + } + layer { + name : 'hidden" + type : kHidden + srclayer: 'image' + } + layer { + name : 'softmax" + type : kSoftmaxLoss + srclayer: 'hidden' + srclayer: 'label' + } + } + +### Energy models + +<img src="../images/rbm-rnn.png" align="center" width="500px"/> +<span><strong>Figure 3 - Convert connections in RBM and RNN.</strong></span> + + +For energy models including RBM, DBM, +etc., their connections are undirected (i.e., Category B). To represent these models using +`NeuralNet`, users can simply replace each connection with two directed +connections, as shown in Figure 3a. In other words, for each pair of connected layers, their source +layer field should include each other's name. +The full [RBM example](rbm.html) has +detailed neural net configuration for a RBM model, which looks like + + net { + layer { + name : "vis" + type : kVisLayer + param { + name : "w1" + } + srclayer: "hid" + } + layer { + name : "hid" + type : kHidLayer + param { + name : "w2" + share_from: "w1" + } + srclayer: "vis" + } + } + +### RNN models + +For recurrent neural networks (RNN), users can remove the recurrent connections +by unrolling the recurrent layer. For example, in Figure 3b, the original +layer is unrolled into a new layer with 4 internal layers. In this way, the +model is like a normal feed-forward model, thus can be configured similarly. +The [RNN example](rnn.html) has a full neural net +configuration for a RNN model. + + +## Configuration for multiple nets + +Typically, a training job includes three neural nets for +training, validation and test phase respectively. The three neural nets share most +layers except the data layer, loss layer or output layer, etc.. To avoid +redundant configurations for the shared layers, users can uses the `exclude` +filed to filter a layer in the neural net, e.g., the following layer will be +filtered when creating the testing `NeuralNet`. + + + layer { + ... + exclude : kTest # filter this layer for creating test net + } + + + +## Neural net partitioning + +A neural net can be partitioned in different ways to distribute the training +over multiple workers. + +### Batch and feature dimension + +<img src="../images/partition_fc.png" align="center" width="400px"/> +<span><strong>Figure 4 - Partitioning of a fully connected layer.</strong></span> + + +Every layer's feature blob is considered a matrix whose rows are feature +vectors. Thus, one layer can be split on two dimensions. Partitioning on +dimension 0 (also called batch dimension) slices the feature matrix by rows. +For instance, if the mini-batch size is 256 and the layer is partitioned into 2 +sub-layers, each sub-layer would have 128 feature vectors in its feature blob. +Partitioning on this dimension has no effect on the parameters, as every +[Param](param.html) object is replicated in the sub-layers. Partitioning on dimension +1 (also called feature dimension) slices the feature matrix by columns. For +example, suppose the original feature vector has 50 units, after partitioning +into 2 sub-layers, each sub-layer would have 25 units. This partitioning may +result in [Param](param.html) object being split, as shown in +Figure 4. Both the bias vector and weight matrix are +partitioned into two sub-layers. + + +### Partitioning configuration + +There are 4 partitioning schemes, whose configurations are give below, + + 1. Partitioning each singe layer into sub-layers on batch dimension (see + below). It is enabled by configuring the partition dimension of the layer to + 0, e.g., + + # with other fields omitted + layer { + partition_dim: 0 + } + + 2. Partitioning each singe layer into sub-layers on feature dimension (see + below). It is enabled by configuring the partition dimension of the layer to + 1, e.g., + + # with other fields omitted + layer { + partition_dim: 1 + } + + 3. Partitioning all layers into different subsets. It is enabled by + configuring the location ID of a layer, e.g., + + # with other fields omitted + layer { + location: 1 + } + layer { + location: 0 + } + + + 4. Hybrid partitioning of strategy 1, 2 and 3. The hybrid partitioning is + useful for large models. An example application is to implement the + [idea proposed by Alex](http://arxiv.org/abs/1404.5997). + Hybrid partitioning is configured like, + + # with other fields omitted + layer { + location: 1 + } + layer { + location: 0 + } + layer { + partition_dim: 0 + location: 0 + } + layer { + partition_dim: 1 + location: 0 + } + +Currently SINGA supports strategy-2 well. Other partitioning strategies are +are under test and will be released in later version. + +## Parameter sharing + +Parameters can be shared in two cases, + + * sharing parameters among layers via user configuration. For example, the + visible layer and hidden layer of a RBM shares the weight matrix, which is configured through + the `share_from` field as shown in the above RBM configuration. The + configurations must be the same (except name) for shared parameters. + + * due to neural net partitioning, some `Param` objects are replicated into + different workers, e.g., partitioning one layer on batch dimension. These + workers share parameter values. SINGA controls this kind of parameter + sharing automatically, users do not need to do any configuration. + + * the `NeuralNet` for training and testing (and validation) share most layers + , thus share `Param` values. + +If the shared `Param` instances resident in the same process (may in different +threads), they use the same chunk of memory space for their values. But they +would have different memory spaces for their gradients. In fact, their +gradients will be averaged by the stub or server. + +## Advanced user guide + +### Creation + + static NeuralNet* NeuralNet::Create(const NetProto& np, Phase phase, int num); + +The above function creates a `NeuralNet` for a given phase, and returns a +pointer to the `NeuralNet` instance. The phase is in {kTrain, +kValidation, kTest}. `num` is used for net partitioning which indicates the +number of partitions. Typically, a training job includes three neural nets for +training, validation and test phase respectively. The three neural nets share most +layers except the data layer, loss layer or output layer, etc.. The `Create` +function takes in the full net configuration including layers for training, +validation and test. It removes layers for phases other than the specified +phase based on the `exclude` field in +[layer configuration](layer.html): + + layer { + ... + exclude : kTest # filter this layer for creating test net + } + +The filtered net configuration is passed to the constructor of `NeuralNet`: + + NeuralNet::NeuralNet(NetProto netproto, int npartitions); + +The constructor creates a graph representing the net structure firstly in + + Graph* NeuralNet::CreateGraph(const NetProto& netproto, int npartitions); + +Next, it creates a layer for each node and connects layers if their nodes are +connected. + + void NeuralNet::CreateNetFromGraph(Graph* graph, int npartitions); + +Since the `NeuralNet` instance may be shared among multiple workers, the +`Create` function returns a pointer to the `NeuralNet` instance . + +### Parameter sharing + + `Param` sharing +is enabled by first sharing the Param configuration (in `NeuralNet::Create`) +to create two similar (e.g., the same shape) Param objects, and then calling +(in `NeuralNet::CreateNetFromGraph`), + + void Param::ShareFrom(const Param& from); + +It is also possible to share `Param`s of two nets, e.g., sharing parameters of +the training net and the test net, + + void NeuralNet:ShareParamsFrom(NeuralNet* other); + +It will call `Param::ShareFrom` for each Param object. + +### Access functions +`NeuralNet` provides a couple of access function to get the layers and params +of the net: + + const std::vector<Layer*>& layers() const; + const std::vector<Param*>& params() const ; + Layer* name2layer(string name) const; + Param* paramid2param(int id) const; + + +### Partitioning + + +#### Implementation + +SINGA partitions the neural net in `CreateGraph` function, which creates one +node for each (partitioned) layer. For example, if one layer's partition +dimension is 0 or 1, then it creates `npartition` nodes for it; if the +partition dimension is -1, a single node is created, i.e., no partitioning. +Each node is assigned a partition (or location) ID. If the original layer is +configured with a location ID, then the ID is assigned to each newly created node. +These nodes are connected according to the connections of the original layers. +Some connection layers will be added automatically. +For instance, if two connected sub-layers are located at two +different workers, then a pair of bridge layers is inserted to transfer the +feature (and gradient) blob between them. When two layers are partitioned on +different dimensions, a concatenation layer which concatenates feature rows (or +columns) and a slice layer which slices feature rows (or columns) would be +inserted. These connection layers help making the network communication and +synchronization transparent to the users. + +#### Dispatching partitions to workers + +Each (partitioned) layer is assigned a location ID, based on which it is dispatched to one +worker. Particularly, the pointer to the `NeuralNet` instance is passed +to every worker within the same group, but each worker only computes over the +layers that have the same partition (or location) ID as the worker's ID. When +every worker computes the gradients of the entire model parameters +(strategy-2), we refer to this process as data parallelism. When different +workers compute the gradients of different parameters (strategy-3 or +strategy-1), we call this process model parallelism. The hybrid partitioning +leads to hybrid parallelism where some workers compute the gradients of the +same subset of model parameters while other workers compute on different model +parameters. For example, to implement the hybrid parallelism in for the +[DCNN model](http://arxiv.org/abs/1404.5997), we set `partition_dim = 0` for +lower layers and `partition_dim = 1` for higher layers. +
Added: incubator/singa/site/trunk/content/markdown/v0.3.0/zh/overview.md URL: http://svn.apache.org/viewvc/incubator/singa/site/trunk/content/markdown/v0.3.0/zh/overview.md?rev=1740048&view=auto ============================================================================== Binary file - no diff available. Propchange: incubator/singa/site/trunk/content/markdown/v0.3.0/zh/overview.md ------------------------------------------------------------------------------ svn:mime-type = application/octet-stream Added: incubator/singa/site/trunk/content/markdown/v0.3.0/zh/programming-guide.md URL: http://svn.apache.org/viewvc/incubator/singa/site/trunk/content/markdown/v0.3.0/zh/programming-guide.md?rev=1740048&view=auto ============================================================================== --- incubator/singa/site/trunk/content/markdown/v0.3.0/zh/programming-guide.md (added) +++ incubator/singa/site/trunk/content/markdown/v0.3.0/zh/programming-guide.md Wed Apr 20 05:09:06 2016 @@ -0,0 +1,67 @@ +# ç¼ç¨æå + +--- + +è¦æäº¤ä¸ä¸ªè®ç»ä½ä¸ï¼ç¨æ·éè¦æä¾å¾1ä¸çå个é¨åçé ç½®ï¼ + + * [NeuralNet](neural-net.html) ï¼æè¿°ç¥ç»ç½ç»ç»æï¼å æ¬æ¯å±çå ·ä½è®¾ç½®åå±ä¸å±çè¿æ¥å ³ç³»ï¼ + * [TrainOneBatch](train-one-batch.html) ï¼è¯¥ç®æ³éè¦æ ¹æ®ä¸åçæ¨¡åç±»å«èå®å¶; + * [Updater](updater.html) ï¼å®ä¹æå¡å¨ç«¯æ´æ°åæ°çåè®®ï¼ + * [Cluster Topology](distributed-training.html) ï¼æå®æå¡å¨åå·¥ä½è çåå¸å¼æææ¶æã + +*åçº§ç¨æ·æå* å°ä»ç»å¦ä½å©ç¨å å»ºå±æäº¤ä¸ä¸ªè®ç»ä½ä¸ï¼è *é«çº§ç¨æ·æå* å°è¯¦ç»ä»ç»å¦ä½ç¼åç¨æ·èªå·±ç䏻彿°å¹¶æ³¨åèªå·±å®ç°çç»ä»¶ãæ¤å¤ï¼é«çº§ç¨æ·ååçº§ç¨æ·å¯¹è®ç»æ°æ®éç[å¤ç](data.html)æ¹å¼æ¯ç¸åçã + +<img src="../../images/overview.png" align="center" width="400px"/> +<span><strong>å¾ 1 - SINGA æ¦è§</strong></span> + + + +## åçº§ç¨æ·æå + +ç¨æ·å¯ä»¥ä½¿ç¨SINGAæä¾ç䏻彿°æäº¤è®ç»ä½ä¸ã对äºè¿ç§æ åµï¼ç¨æ·å¿ é¡»å¨å½ä»¤è¡ä¸æä¾æ ¹æ® [JobProto](../api/classsinga_1_1JobProto.html) 设置çä½ä¸é ç½®æä»¶ï¼ + + ./bin/singa-run.sh -conf <path to job conf> [-resume] + +`-resume` 表示ä»ä¸æ¬¡ç[æ£æ¥ç¹ï¼checkpointï¼](checkpoint.html)ç»§ç»è®ç»ã +[MLP](mlp.html) 模åå [CNN](cnn.html) 模å使ç¨å å»ºå±æäº¤è®ç»ä½ä¸ã请é 读ç¸å ³é¡µé¢ï¼æ¥çå®ä»¬çä½ä¸é ç½®æä»¶ï¼è¿äºé¡µé¢ä¼ä»ç»æ¯ä¸ªç»ä»¶é ç½®çç»èã + +## é«çº§ç¨æ·æå + +å¦æç¨æ·ç模åä¸å å«ä¸äºèªå·±å®ä¹çç»ä»¶ï¼æ¯å¦[Updater](updater.html)ï¼ç¨æ·å¿ é¡»èªå·±ç¼å䏻彿°æ³¨åè¿äºç»ä»¶ï¼è·Hadoopç䏻彿°ç±»ä¼¼ãä¸è¬å°ï¼ä¸»å½æ°åºè¯¥ + + * åå§åSINGAï¼å¦ï¼è®¾ç½®æ¥å¿ï¼ + * 注åç¨æ·èªå®ä¹ç»ä»¶ï¼ + * å建ä½ä¸é ç½®å¹¶ä¼ éç»SINGA driverã + +䏻彿°ç¤ºä¾ + + #include "singa.h" + #include "user.h" // header for user code + + int main(int argc, char** argv) { + singa::Driver driver; + driver.Init(argc, argv); + bool resume; + // parse resume option from argv. + + // register user defined layers + driver.RegisterLayer<FooLayer>(kFooLayer); + // register user defined updater + driver.RegisterUpdater<FooUpdater>(kFooUpdater); + ... + auto jobConf = driver.job_conf(); + // update jobConf + + driver.Train(resume, jobConf); + return 0; + } + +driver ç±»' `Init` æ¹æ³å è½½ç¨æ·å¨å½ä»¤è¡åæ°ä¸ ï¼`-conf <job conf>`ï¼æä¾çä½ä¸é ç½®æä»¶ï¼è³å°å å«é群ææç»æï¼ï¼å¹¶è¿å`jobConf`ç»ç¨æ·ï¼ç¨æ·å¯æ´æ°åæ·»å ç¥ç»ç½ç»æè Updaterçé ç½®ã妿å®ä¹äºLayerãUpdaterãWorkeræè Paramçåç±»ï¼ç¨æ·éè¦éè¿driver为å®ä»¬æ³¨åãæåï¼ä½ä¸é ç½®ä¼è¢«æäº¤å°driverï¼ç±driverå¯å¨è®ç»ã + +å°æ¥æä»¬ä¼æä¾ç±»ä¼¼[keras](https://github.com/fchollet/keras) ç帮å©å·¥å ·ï¼ä½¿ä½ä¸é ç½®æ´å ç®åã + +ç¨æ·éè¦ä½¿ç¨SINGAåº(*.libs/libsinga.so*)ç¼è¯å龿¥èªå·±ç代ç ï¼å¦ï¼layerçå®ç°å䏻彿°ï¼ï¼å¾å°å¯æ§è¡æä»¶ï¼å¦å为*mysinga* çæä»¶ãæ§è¡ä»¥ä¸å½ä»¤å¯å¨è¯¥ç¨åºï¼ç¨æ·éè¦å°*mysinga* åä½ä¸é ç½®æä»¶çè·¯å¾ä¼ ç» *./bin/singa-run.sh* ã + + ./bin/singa-run.sh -conf <path to job conf> -exec <path to mysinga> [other arguments] + +[RNN application](rnn.html) æä¾äºä¸ä¸ªå®æ´çå®ç°ä¸»å½æ°è®ç»ç¹å®RNN模åçä¾åã Added: incubator/singa/site/trunk/content/markdown/v0.3.0/zh/rnn.md URL: http://svn.apache.org/viewvc/incubator/singa/site/trunk/content/markdown/v0.3.0/zh/rnn.md?rev=1740048&view=auto ============================================================================== --- incubator/singa/site/trunk/content/markdown/v0.3.0/zh/rnn.md (added) +++ incubator/singa/site/trunk/content/markdown/v0.3.0/zh/rnn.md Wed Apr 20 05:09:06 2016 @@ -0,0 +1,420 @@ +# Recurrent Neural Networks for Language Modelling + +--- + +Recurrent Neural Networks (RNN) are widely used for modelling sequential data, +such as music and sentences. In this example, we use SINGA to train a +[RNN model](http://www.fit.vutbr.cz/research/groups/speech/publi/2010/mikolov_interspeech2010_IS100722.pdf) +proposed by Tomas Mikolov for [language modeling](https://en.wikipedia.org/wiki/Language_model). +The training objective (loss) is +to minimize the [perplexity per word](https://en.wikipedia.org/wiki/Perplexity), which +is equivalent to maximize the probability of predicting the next word given the current word in +a sentence. + +Different to the [CNN](cnn.html), [MLP](mlp.html) +and [RBM](rbm.html) examples which use built-in +layers(layer) and records(data), +none of the layers in this example are built-in. Hence users would learn to +implement their own layers and data records through this example. + +## Running instructions + +In *SINGA_ROOT/examples/rnnlm/*, scripts are provided to run the training job. +First, the data is prepared by + + $ cp Makefile.example Makefile + $ make download + $ make create + +Second, to compile the source code under *examples/rnnlm/*, run + + $ make rnnlm + +An executable file *rnnlm.bin* will be generated. + +Third, the training is started by passing *rnnlm.bin* and the job configuration +to *singa-run.sh*, + + # at SINGA_ROOT/ + # export LD_LIBRARY_PATH=.libs:$LD_LIBRARY_PATH + $ ./bin/singa-run.sh -exec examples/rnnlm/rnnlm.bin -conf examples/rnnlm/job.conf + +## Implementations + +<img src="../images/rnnlm.png" align="center" width="400px"/> +<span><strong>Figure 1 - Net structure of the RNN model.</strong></span> + +The neural net structure is shown Figure 1. Word records are loaded by +`DataLayer`. For every iteration, at most `max_window` word records are +processed. If a sentence ending character is read, the `DataLayer` stops +loading immediately. `EmbeddingLayer` looks up a word embedding matrix to extract +feature vectors for words loaded by the `DataLayer`. These features are transformed by the +`HiddenLayer` which propagates the features from left to right. The +output feature for word at position k is influenced by words from position 0 to +k-1. Finally, `LossLayer` computes the cross-entropy loss (see below) +by predicting the next word of each word. +The cross-entropy loss is computed as + +`$$L(w_t)=-log P(w_{t+1}|w_t)$$` + +Given `$w_t$` the above equation would compute over all words in the vocabulary, +which is time consuming. +[RNNLM Toolkit](https://f25ea9ccb7d3346ce6891573d543960492b92c30.googledrive.com/host/0ByxdPXuxLPS5RFM5dVNvWVhTd0U/rnnlm-0.4b.tgz) +accelerates the computation as + +`$$P(w_{t+1}|w_t) = P(C_{w_{t+1}}|w_t) * P(w_{t+1}|C_{w_{t+1}})$$` + +Words from the vocabulary are partitioned into a user-defined number of classes. +The first term on the left side predicts the class of the next word, and +then predicts the next word given its class. Both the number of classes and +the words from one class are much smaller than the vocabulary size. The probabilities +can be calculated much faster. + +The perplexity per word is computed by, + +`$$PPL = 10^{- avg_t log_{10} P(w_{t+1}|w_t)}$$` + +### Data preparation + +We use a small dataset provided by the [RNNLM Toolkit](https://f25ea9ccb7d3346ce6891573d543960492b92c30.googledrive.com/host/0ByxdPXuxLPS5RFM5dVNvWVhTd0U/rnnlm-0.4b.tgz). +It has 10,000 training sentences, with 71350 words in total and 3720 unique words. +The subsequent steps follow the instructions in +[Data Preparation](data.html) to convert the +raw data into records and insert them into data stores. + +#### Download source data + + # in SINGA_ROOT/examples/rnnlm/ + cp Makefile.example Makefile + make download + +#### Define record format + +We define the word record as follows, + + # in SINGA_ROOT/examples/rnnlm/rnnlm.proto + message WordRecord { + optional string word = 1; + optional int32 word_index = 2; + optional int32 class_index = 3; + optional int32 class_start = 4; + optional int32 class_end = 5; + } + +It includes the word string and its index in the vocabulary. +Words in the vocabulary are sorted based on their frequency in the training dataset. +The sorted list is cut into 100 sublists such that each sublist has 1/100 total +word frequency. Each sublist is called a class. +Hence each word has a `class_index` ([0,100)). The `class_start` is the index +of the first word in the same class as `word`. The `class_end` is the index of +the first word in the next class. + +#### Create data stores + +We use code from RNNLM Toolkit to read words, and sort them into classes. +The main function in *create_store.cc* first creates word classes based on the training +dataset. Second it calls the following function to create data store for the +training, validation and test dataset. + + int create_data(const char *input_file, const char *output_file); + +`input` is the path to training/validation/testing text file from the RNNLM Toolkit, `output` is output store file. +This function starts with + + singa::io::KVFile store; + store.Open(output, signa::io::kCreate); + +Then it reads the words one by one. For each word it creates a `WordRecord` instance, +and inserts it into the store, + + int wcnt = 0; // word count + WordRecord wordRecord; + while(1) { + readWord(wordstr, fin); + if (feof(fin)) break; + ...// fill in the wordRecord; + string val; + wordRecord.SerializeToString(&val); + int length = snprintf(key, BUFFER_LEN, "%05d", wcnt++); + store.Write(string(key, length), val); + } + +Compilation and running commands are provided in the *Makefile.example*. +After executing + + make create + +*train_data.bin*, *test_data.bin* and *valid_data.bin* will be created. + + +### Layer implementation + +4 user-defined layers are implemented for this application. +Following the guide for implementing [new Layer subclasses](layer#implementing-a-new-layer-subclass), +we extend the [LayerProto](../api/classsinga_1_1LayerProto.html) +to include the configuration messages of user-defined layers as shown below +(3 out of the 7 layers have specific configurations), + + + import "job.proto"; // Layer message for SINGA is defined + + //For implementation of RNNLM application + extend singa.LayerProto { + optional EmbeddingProto embedding_conf = 101; + optional LossProto loss_conf = 102; + optional DataProto data_conf = 103; + } + +In the subsequent sections, we describe the implementation of each layer, +including its configuration message. + +#### RNNLayer + +This is the base layer of all other layers for this applications. It is defined +as follows, + + class RNNLayer : virtual public Layer { + public: + inline int window() { return window_; } + protected: + int window_; + }; + +For this application, two iterations may process different number of words. +Because sentences have different lengths. +The `DataLayer` decides the effective window size. All other layers call its source layers to get the +effective window size and resets `window_` in `ComputeFeature` function. + +#### DataLayer + +DataLayer is for loading Records. + + class DataLayer : public RNNLayer, singa::InputLayer { + public: + void Setup(const LayerProto& proto, const vector<Layer*>& srclayers) override; + void ComputeFeature(int flag, const vector<Layer*>& srclayers) override; + int max_window() const { + return max_window_; + } + private: + int max_window_; + singa::io::Store* store_; + }; + +The Setup function gets the user configured max window size. + + max_window_ = proto.GetExtension(input_conf).max_window(); + +The `ComputeFeature` function loads at most max_window records. It could also +stop when the sentence ending character is encountered. + + ...// shift the last record to the first + window_ = max_window_; + for (int i = 1; i <= max_window_; i++) { + // load record; break if it is the ending character + } + +The configuration of `DataLayer` is like + + name: "data" + user_type: "kData" + [data_conf] { + path: "examples/rnnlm/train_data.bin" + max_window: 10 + } + +#### EmbeddingLayer + +This layer gets records from `DataLayer`. For each record, the word index is +parsed and used to get the corresponding word feature vector from the embedding +matrix. + +The class is declared as follows, + + class EmbeddingLayer : public RNNLayer { + ... + const std::vector<Param*> GetParams() const override { + std::vector<Param*> params{embed_}; + return params; + } + private: + int word_dim_, vocab_size_; + Param* embed_; + } + +The `embed_` field is a matrix whose values are parameter to be learned. +The matrix size is `vocab_size_` x `word_dim_`. + +The Setup function reads configurations for `word_dim_` and `vocab_size_`. Then +it allocates feature Blob for `max_window` words and setups `embed_`. + + int max_window = srclayers[0]->data(this).shape()[0]; + word_dim_ = proto.GetExtension(embedding_conf).word_dim(); + data_.Reshape(vector<int>{max_window, word_dim_}); + ... + embed_->Setup(vector<int>{vocab_size_, word_dim_}); + +The `ComputeFeature` function simply copies the feature vector from the `embed_` +matrix into the feature Blob. + + # reset effective window size + window_ = datalayer->window(); + auto records = datalayer->records(); + ... + for (int t = 0; t < window_; t++) { + int idx <- word index + Copy(words[t], embed[idx]); + } + +The `ComputeGradient` function copies back the gradients to the `embed_` matrix. + +The configuration for `EmbeddingLayer` is like, + + user_type: "kEmbedding" + [embedding_conf] { + word_dim: 15 + vocab_size: 3720 + } + srclayers: "data" + param { + name: "w1" + init { + type: kUniform + low:-0.3 + high:0.3 + } + } + +#### HiddenLayer + +This layer unrolls the recurrent connections for at most max_window times. +The feature for position k is computed based on the feature from the embedding layer (position k) +and the feature at position k-1 of this layer. The formula is + +`$$f[k]=\sigma (f[t-1]*W+src[t])$$` + +where `$W$` is a matrix with `word_dim_` x `word_dim_` parameters. + +If you want to implement a recurrent neural network following our +design, this layer is of vital importance for you to refer to. + + class HiddenLayer : public RNNLayer { + ... + const std::vector<Param*> GetParams() const override { + std::vector<Param*> params{weight_}; + return params; + } + private: + Param* weight_; + }; + +The `Setup` function setups the weight matrix as + + weight_->Setup(std::vector<int>{word_dim, word_dim}); + +The `ComputeFeature` function gets the effective window size (`window_`) from its source layer +i.e., the embedding layer. Then it propagates the feature from position 0 to position +`window_` -1. The detailed descriptions for this process are illustrated as follows. + + void HiddenLayer::ComputeFeature() { + for(int t = 0; t < window_size; t++){ + if(t == 0) + Copy(data[t], src[t]); + else + data[t]=sigmoid(data[t-1]*W + src[t]); + } + } + +The `ComputeGradient` function computes the gradient of the loss w.r.t. W and the source layer. +Particularly, for each position k, since data[k] contributes to data[k+1] and the feature +at position k in its destination layer (the loss layer), grad[k] should contains the gradient +from two parts. The destination layer has already computed the gradient from the loss layer into +grad[k]; In the `ComputeGradient` function, we need to add the gradient from position k+1. + + void HiddenLayer::ComputeGradient(){ + ... + for (int k = window_ - 1; k >= 0; k--) { + if (k < window_ - 1) { + grad[k] += dot(grad[k + 1], weight.T()); // add gradient from position t+1. + } + grad[k] =... // compute gL/gy[t], y[t]=data[t-1]*W+src[t] + } + gweight = dot(data.Slice(0, window_-1).T(), grad.Slice(1, window_)); + Copy(gsrc, grad); + } + +After the loop, we get the gradient of the loss w.r.t y[k], which is used to +compute the gradient of W and the src[k]. + +#### LossLayer + +This layer computes the cross-entropy loss and the `$log_{10}P(w_{t+1}|w_t)$` (which +could be averaged over all words by users to get the PPL value). + +There are two configuration fields to be specified by users. + + message LossProto { + optional int32 nclass = 1; + optional int32 vocab_size = 2; + } + +There are two weight matrices to be learned + + class LossLayer : public RNNLayer { + ... + private: + Param* word_weight_, *class_weight_; + } + +The ComputeFeature function computes the two probabilities respectively. + +`$$P(C_{w_{t+1}}|w_t) = Softmax(w_t * class\_weight_)$$` +`$$P(w_{t+1}|C_{w_{t+1}}) = Softmax(w_t * word\_weight[class\_start:class\_end])$$` + +`$w_t$` is the feature from the hidden layer for the k-th word, its ground truth +next word is `$w_{t+1}$`. The first equation computes the probability distribution over all +classes for the next word. The second equation computes the +probability distribution over the words in the ground truth class for the next word. + +The ComputeGradient function computes the gradient of the source layer +(i.e., the hidden layer) and the two weight matrices. + +### Updater Configuration + +We employ kFixedStep type of the learning rate change method and the +configuration is as follows. We decay the learning rate once the performance does +not increase on the validation dataset. + + updater{ + type: kSGD + learning_rate { + type: kFixedStep + fixedstep_conf:{ + step:0 + step:48810 + step:56945 + step:65080 + step:73215 + step_lr:0.1 + step_lr:0.05 + step_lr:0.025 + step_lr:0.0125 + step_lr:0.00625 + } + } + } + +### TrainOneBatch() Function + +We use BP (BackPropagation) algorithm to train the RNN model here. The +corresponding configuration can be seen below. + + # In job.conf file + train_one_batch { + alg: kBackPropagation + } + +### Cluster Configuration + +The default cluster configuration can be used, i.e., single worker and single server +in a single process. Added: incubator/singa/site/trunk/content/markdown/v0.3.0/zh/train-one-batch.md URL: http://svn.apache.org/viewvc/incubator/singa/site/trunk/content/markdown/v0.3.0/zh/train-one-batch.md?rev=1740048&view=auto ============================================================================== --- incubator/singa/site/trunk/content/markdown/v0.3.0/zh/train-one-batch.md (added) +++ incubator/singa/site/trunk/content/markdown/v0.3.0/zh/train-one-batch.md Wed Apr 20 05:09:06 2016 @@ -0,0 +1,179 @@ +# Train-One-Batch + +--- + +For each SGD iteration, every worker calls the `TrainOneBatch` function to +compute gradients of parameters associated with local layers (i.e., layers +dispatched to it). SINGA has implemented two algorithms for the +`TrainOneBatch` function. Users select the corresponding algorithm for +their model in the configuration. + +## Basic user guide + +### Back-propagation + +[BP algorithm](http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf) is used for +computing gradients of feed-forward models, e.g., [CNN](cnn.html) +and [MLP](mlp.html), and [RNN](rnn.html) models in SINGA. + + + # in job.conf + alg: kBP + +To use the BP algorithm for the `TrainOneBatch` function, users just simply +configure the `alg` field with `kBP`. If a neural net contains user-defined +layers, these layers must be implemented properly be to consistent with the +implementation of the BP algorithm in SINGA (see below). + + +### Contrastive Divergence + +[CD algorithm](http://www.cs.toronto.edu/~fritz/absps/nccd.pdf) is used for +computing gradients of energy models like RBM. + + # job.conf + alg: kCD + cd_conf { + cd_k: 2 + } + +To use the CD algorithm for the `TrainOneBatch` function, users just configure +the `alg` field to `kCD`. Uses can also configure the Gibbs sampling steps in +the CD algorthm through the `cd_k` field. By default, it is set to 1. + + + +## Advanced user guide + +### Implementation of BP + +The BP algorithm is implemented in SINGA following the below pseudo code, + + BPTrainOnebatch(step, net) { + // forward propagate + foreach layer in net.local_layers() { + if IsBridgeDstLayer(layer) + recv data from the src layer (i.e., BridgeSrcLayer) + foreach param in layer.params() + Collect(param) // recv response from servers for last update + + layer.ComputeFeature(kForward) + + if IsBridgeSrcLayer(layer) + send layer.data_ to dst layer + } + // backward propagate + foreach layer in reverse(net.local_layers) { + if IsBridgeSrcLayer(layer) + recv gradient from the dst layer (i.e., BridgeDstLayer) + recv response from servers for last update + + layer.ComputeGradient() + foreach param in layer.params() + Update(step, param) // send param.grad_ to servers + + if IsBridgeDstLayer(layer) + send layer.grad_ to src layer + } + } + + +It forwards features through all local layers (can be checked by layer +partition ID and worker ID) and backwards gradients in the reverse order. +[BridgeSrcLayer](layer.html#bridgesrclayer--bridgedstlayer) +(resp. `BridgeDstLayer`) will be blocked until the feature (resp. +gradient) from the source (resp. destination) layer comes. Parameter gradients +are sent to servers via `Update` function. Updated parameters are collected via +`Collect` function, which will be blocked until the parameter is updated. +[Param](param.html) objects have versions, which can be used to +check whether the `Param` objects have been updated or not. + +Since RNN models are unrolled into feed-forward models, users need to implement +the forward propagation in the recurrent layer's `ComputeFeature` function, +and implement the backward propagation in the recurrent layer's `ComputeGradient` +function. As a result, the whole `TrainOneBatch` runs +[back-propagation through time (BPTT)](https://en.wikipedia.org/wiki/Backpropagation_through_time) algorithm. + +### Implementation of CD + +The CD algorithm is implemented in SINGA following the below pseudo code, + + CDTrainOneBatch(step, net) { + # positive phase + foreach layer in net.local_layers() + if IsBridgeDstLayer(layer) + recv positive phase data from the src layer (i.e., BridgeSrcLayer) + foreach param in layer.params() + Collect(param) // recv response from servers for last update + layer.ComputeFeature(kPositive) + if IsBridgeSrcLayer(layer) + send positive phase data to dst layer + + # negative phase + foreach gibbs in [0...layer_proto_.cd_k] + foreach layer in net.local_layers() + if IsBridgeDstLayer(layer) + recv negative phase data from the src layer (i.e., BridgeSrcLayer) + layer.ComputeFeature(kPositive) + if IsBridgeSrcLayer(layer) + send negative phase data to dst layer + + foreach layer in net.local_layers() + layer.ComputeGradient() + foreach param in layer.params + Update(param) + } + +Parameter gradients are computed after the positive phase and negative phase. + +### Implementing a new algorithm + +SINGA implements BP and CD by creating two subclasses of +the [Worker](../api/classsinga_1_1Worker.html) class: +[BPWorker](../api/classsinga_1_1BPWorker.html)'s `TrainOneBatch` function implements the BP +algorithm; [CDWorker](../api/classsinga_1_1CDWorker.html)'s `TrainOneBatch` function implements the CD +algorithm. To implement a new algorithm for the `TrainOneBatch` function, users +need to create a new subclass of the `Worker`, e.g., + + class FooWorker : public Worker { + void TrainOneBatch(int step, shared_ptr<NeuralNet> net, Metric* perf) override; + void TestOneBatch(int step, Phase phase, shared_ptr<NeuralNet> net, Metric* perf) override; + }; + +The `FooWorker` must implement the above two functions for training one +mini-batch and testing one mini-batch. The `perf` argument is for collecting +training or testing performance, e.g., the objective loss or accuracy. It is +passed to the `ComputeFeature` function of each layer. + +Users can define some fields for users to configure + + # in user.proto + message FooWorkerProto { + optional int32 b = 1; + } + + extend JobProto { + optional FooWorkerProto foo_conf = 101; + } + + # in job.proto + JobProto { + ... + extension 101..max; + } + +It is similar as [adding configuration fields for a new layer](layer.html#implementing-a-new-layer-subclass). + +To use `FooWorker`, users need to register it in the [main.cc](programming-guide.html) +and configure the `alg` and `foo_conf` fields, + + # in main.cc + const int kFoo = 3; // worker ID, must be different to that of CDWorker and BPWorker + driver.RegisterWorker<FooWorker>(kFoo); + + # in job.conf + ... + alg: 3 + [foo_conf] { + b = 4; + } Added: incubator/singa/site/trunk/content/markdown/v0.3.0/zh/updater.md URL: http://svn.apache.org/viewvc/incubator/singa/site/trunk/content/markdown/v0.3.0/zh/updater.md?rev=1740048&view=auto ============================================================================== --- incubator/singa/site/trunk/content/markdown/v0.3.0/zh/updater.md (added) +++ incubator/singa/site/trunk/content/markdown/v0.3.0/zh/updater.md Wed Apr 20 05:09:06 2016 @@ -0,0 +1,284 @@ +# Updater + +--- + +Every server in SINGA has an [Updater](../api/classsinga_1_1Updater.html) +instance that updates parameters based on gradients. +In this page, the *Basic user guide* describes the configuration of an updater. +The *Advanced user guide* present details on how to implement a new updater and a new +learning rate changing method. + +## Basic user guide + +There are many different parameter updating protocols (i.e., subclasses of +`Updater`). They share some configuration fields like + +* `type`, an integer for identifying an updater; +* `learning_rate`, configuration for the +[LRGenerator](../api/classsinga_1_1LRGenerator.html) which controls the learning rate. +* `weight_decay`, the co-efficient for [L2 * regularization](http://deeplearning.net/tutorial/gettingstarted.html#regularization). +* [momentum](http://ufldl.stanford.edu/tutorial/supervised/OptimizationStochasticGradientDescent/). + +If you are not familiar with the above terms, you can get their meanings in +[this page provided by Karpathy](http://cs231n.github.io/neural-networks-3/#update). + +### Configuration of built-in updater classes + +#### Updater +The base `Updater` implements the [vanilla SGD algorithm](http://cs231n.github.io/neural-networks-3/#sgd). +Its configuration type is `kSGD`. +Users need to configure at least the `learning_rate` field. +`momentum` and `weight_decay` are optional fields. + + updater{ + type: kSGD + momentum: float + weight_decay: float + learning_rate { + ... + } + } + +#### AdaGradUpdater + +It inherits the base `Updater` to implement the +[AdaGrad](http://www.magicbroom.info/Papers/DuchiHaSi10.pdf) algorithm. +Its type is `kAdaGrad`. +`AdaGradUpdater` is configured similar to `Updater` except +that `momentum` is not used. + +#### NesterovUpdater + +It inherits the base `Updater` to implements the +[Nesterov](http://arxiv.org/pdf/1212.0901v2.pdf) (section 3.5) updating protocol. +Its type is `kNesterov`. +`learning_rate` and `momentum` must be configured. `weight_decay` is an +optional configuration field. + +#### RMSPropUpdater + +It inherits the base `Updater` to implements the +[RMSProp algorithm](http://cs231n.github.io/neural-networks-3/#sgd) proposed by +[Hinton](http://www.cs.toronto.edu/%7Etijmen/csc321/slides/lecture_slides_lec6.pdf)(slide 29). +Its type is `kRMSProp`. + + updater { + type: kRMSProp + rmsprop_conf { + rho: float # [0,1] + } + } + + +### Configuration of learning rate + +The `learning_rate` field is configured as, + + learning_rate { + type: ChangeMethod + base_lr: float # base/initial learning rate + ... # fields to a specific changing method + } + +The common fields include `type` and `base_lr`. SINGA provides the following +`ChangeMethod`s. + +#### kFixed + +The `base_lr` is used for all steps. + +#### kLinear + +The updater should be configured like + + learning_rate { + base_lr: float + linear_conf { + freq: int + final_lr: float + } + } + +Linear interpolation is used to change the learning rate, + + lr = (1 - step / freq) * base_lr + (step / freq) * final_lr + +#### kExponential + +The udapter should be configured like + + learning_rate { + base_lr: float + exponential_conf { + freq: int + } + } + +The learning rate for `step` is + + lr = base_lr / 2^(step / freq) + +#### kInverseT + +The updater should be configured like + + learning_rate { + base_lr: float + inverset_conf { + final_lr: float + } + } + +The learning rate for `step` is + + lr = base_lr / (1 + step / final_lr) + +#### kInverse + +The updater should be configured like + + learning_rate { + base_lr: float + inverse_conf { + gamma: float + pow: float + } + } + + +The learning rate for `step` is + + lr = base_lr * (1 + gamma * setp)^(-pow) + + +#### kStep + +The updater should be configured like + + learning_rate { + base_lr : float + step_conf { + change_freq: int + gamma: float + } + } + + +The learning rate for `step` is + + lr = base_lr * gamma^ (step / change_freq) + +#### kFixedStep + +The updater should be configured like + + learning_rate { + fixedstep_conf { + step: int + step_lr: float + + step: int + step_lr: float + + ... + } + } + +Denote the i-th tuple as (step[i], step_lr[i]), then the learning rate for +`step` is, + + step_lr[k] + +where step[k] is the smallest number that is larger than `step`. + + +## Advanced user guide + +### Implementing a new Updater subclass + +The base Updater class has one virtual function, + + class Updater{ + public: + virtual void Update(int step, Param* param, float grad_scale = 1.0f) = 0; + + protected: + UpdaterProto proto_; + LRGenerator lr_gen_; + }; + +It updates the values of the `param` based on its gradients. The `step` +argument is for deciding the learning rate which may change through time +(step). `grad_scale` scales the original gradient values. This function is +called by servers once it receives all gradients for the same `Param` object. + +To implement a new Updater subclass, users must override the `Update` function. + + class FooUpdater : public Updater { + void Update(int step, Param* param, float grad_scale = 1.0f) override; + }; + +Configuration of this new updater can be declared similar to that of a new +layer, + + # in user.proto + FooUpdaterProto { + optional int32 c = 1; + } + + extend UpdaterProto { + optional FooUpdaterProto fooupdater_conf= 101; + } + +The new updater should be registered in the +[main function](programming-guide.html) + + driver.RegisterUpdater<FooUpdater>("FooUpdater"); + +Users can then configure the job as + + # in job.conf + updater { + user_type: "FooUpdater" # must use user_type with the same string identifier as the one used for registration + fooupdater_conf { + c : 20; + } + } + +### Implementing a new LRGenerator subclass + +The base `LRGenerator` is declared as, + + virtual float Get(int step); + +To implement a subclass, e.g., `FooLRGen`, users should declare it like + + class FooLRGen : public LRGenerator { + public: + float Get(int step) override; + }; + +Configuration of `FooLRGen` can be defined using a protocol message, + + # in user.proto + message FooLRProto { + ... + } + + extend LRGenProto { + optional FooLRProto foolr_conf = 101; + } + +The configuration is then like, + + learning_rate { + user_type : "FooLR" # must use user_type with the same string identifier as the one used for registration + base_lr: float + foolr_conf { + ... + } + } + +Users have to register this subclass in the main function, + + driver.RegisterLRGenerator<FooLRGen, std::string>("FooLR")
