[ 
https://issues.apache.org/jira/browse/MAHOUT-1388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14060119#comment-14060119
 ] 

Felix Schüler commented on MAHOUT-1388:
---------------------------------------

We played around with the existing implementation and came across some issues 
that might need some clarification/fixes. We could try to fix some of the 
issues but would be glad to get some feedback first, especially given that the 
mlp resides in the mrlegacy package and might not be used any more as soon as 
an implementation in the spark DSL exists.

- First of all, it seems like the MLP CLI does not perform iterations of any 
kind during training.  This is especially unpleasant in the case of a small 
dataset such as the iris data-set. In the corresponding unit-test, 2000 
iterations are performed on the input data whereas the command line version 
only forwards the input once. This leads to wrong output on the validation data.
We think there should be a solution to this that either consists of an 
iteration parameter or the possibility to define a train/validation split and 
use the technique of early stopping where iteration stops if no significant 
improvement on the validation-set is observed.

- In the RunMultilayerperceptron case, the parameter -cr (column range) can not 
be set. Usually, the classified data doesn't have labels, but we think it 
should still be possible to select the columns of an input file for validation, 
especially if we split the same dataset into training and validation parts, we 
don't want to remove all the labels by hand. The fix for this is fairly easy 
since the functionality is already implemented and just has to be added to the 
argument-parser (we will provide a patch for this).

- We are not sure if the CLI-MLP can be used for regression since all the 
labels have to be provided as arguments. 

- small typo in momentumweight: "momemtumweight", we can provide the patch for 
this as well.



> Add command line support and logging for MLP
> --------------------------------------------
>
>                 Key: MAHOUT-1388
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1388
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 1.0
>            Reporter: Yexi Jiang
>            Assignee: Suneel Marthi
>              Labels: mlp, sgd
>             Fix For: 1.0
>
>         Attachments: Mahout-1388.patch, Mahout-1388.patch
>
>
> The user should have the ability to run the Perceptron from the command line.
> There are two programs to execute MLP, the training and labeling. The first 
> one takes the data as input and outputs the model, the second one takes the 
> model and unlabeled data as input and outputs the results.
> The parameters for training are as follows:
> ------------------------------------------------
> --input -i (input data)
> --skipHeader -sk // whether to skip the first row, this parameter is optional
> --labels -labels // the labels of the instances, separated by whitespace. 
> Take the iris dataset for example, the labels are 'setosa versicolor 
> virginica'.
> --model -mo  // in training mode, this is the location to store the model (if 
> the specified location has an existing model, it will update the model 
> through incremental learning), in labeling mode, this is the location to 
> store the result
> --update -u // whether to incremental update the model, if this parameter is 
> not given, train the model from scratch
> --output -o           // this is only useful in labeling mode
> --layersize -ls (no. of units per hidden layer) // use whitespace separated 
> number to indicate the number of neurons in each layer (including input layer 
> and output layer), e.g. '5 3 2'.
> --squashingFunction -sf // currently only supports Sigmoid
> --momentum -m 
> --learningrate -l
> --regularizationweight -r
> --costfunction -cf   // the type of cost function,
> ------------------------------------------------
> For example, train a 3-layer (including input, hidden, and output) MLP with 
> 0.1 learning rate, 0.1 momentum rate, and 0.01 regularization weight, the 
> parameter would be:
> mlp -i /tmp/training-data.csv -labels setosa versicolor virginica -o 
> /tmp/model.model -ls 5,3,1 -l 0.1 -m 0.1 -r 0.01
> This command would read the training data from /tmp/training-data.csv and 
> write the trained model to /tmp/model.model.
> The parameters for labeling is as follows:
> -------------------------------------------------------------
> --input -i // input file path
> --columnRange -cr // the range of column used for feature, start from 0 and 
> separated by whitespace, e.g. 0 5
> --format -f // the format of input file, currently only supports csv
> --model -mo // the file path of the model
> --output -o // the output path for the results
> -------------------------------------------------------------
> If a user need to use an existing model, it will use the following command:
> mlp -i /tmp/unlabel-data.csv -m /tmp/model.model -o /tmp/label-result
> Moreover, we should be providing default values if the user does not specify 
> any. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to