GitHub user JeremyNixon opened a pull request:
https://github.com/apache/spark/pull/13617
[SPARK-10409] [ML] Add Multilayer Perceptron Regression to ML
## What changes were proposed in this pull request?
This is a pull request adding support for Multilayer Perceptron Regression,
the counterpart to the Multilayer Perceptron Classifier (hereafter MLPR and
MLPC).
#### Outline
1. Major Changes
2. API Decisions
3. Automating Scaling
4. Naming and Features
5. Reference Resources
6. Testing
## Major Changes
There are two major differences between MLPR and MLPC. The first is the use
of an linear (identity) activation function and a sum of squared error cost
function in the last layer of the network. The second is the requirement to
scale the data to [0,1] and back to make it easy for the weights to fit a value
in the proper range.
#### Linear Activation
In the forward pass the linear activation passes the value from the fully
connected layer through to become the network prediction. In weight adjustment
during the backward pass its derivative is one. All regression models will use
the linear activation in the last layer, and so there is no option (as there is
in MLPC) to use another activation function and cost function in the last layer.
#### Automated Scaling
The data scaling is done through min-max scaling, where the minimum label
is subtracted from every value (leading to a range from [0 to max-min]) and
then dividing by max-min to get a scale from 0 to 1. The corner case where
max-min = 0 is resolved by omitting the division step.
#### Motivating Examples
<img width="870" alt="screen shot 2016-06-10 at 11 26 18 pm"
src="https://cloud.githubusercontent.com/assets/4738024/15983614/d1ef0fe2-2f62-11e6-89f1-dc0c0dd6be94.png">

## API Decisions
The API is identical to MLPC with the exception of softmaxOnTop - there is
no option on the last layer activation function, or on the cost function to be
used (MLPC gives a choice between cross entropy and sum of square error). This
API has the user call MLPR with a set of layers that represent the topology of
the network. The number of hidden layers is inferred by the parameter for the
labels and is equal to the total number of layers - 2. Each hidden layer will
be a feedforward layer with a sigmoid activation function up to the output
layer and its linear activation.
#### Input/Output Layer Argument
For MLPR, the output count will always be 1, and the number of inputs will
always be equal to the number of features in the training dataset. One API
choice could be to omit the input and output counts and only have the user
supply the number of neurons in the hidden layers, and automate the input and
output counts by looking at the training data. At the very least, it makes
sense to validate the userâs layer parameter and display a helpful error
message instead of the error in the data stacker that currently appears if the
improper number of inputs or outputs is provided.
#### Modular API
It also would make sense for the API to be modular. A user will want the
flexibility to use the linear layer at different points in the network (as well
as in MLPC), and will certainly want to be able to use new activation functions
(tanh, relu) that are added to improve the performance of these models. That
flexibility allows a user to tune the network to their dataset and will be
particularly important for convnets or recurrent nets in future.
## Automating Scaling
Current behavior is to automatically scale the data for the user. This
makes a pass over the data. There are a few options. We could:
1. Scale data internally, always.
2. Scale data internally unless user provides min/max themselves.
3. Create argument that turns internal scaling off/on. Default it to one or
the other. Warn user if running on unscaled data.
As well as all the variants between autoscaling or not, adding an argument
or not, and warning the user or not.
The algorithm will run quite poorly on unscaled data, and so it makes sense
to safeguard the user from this. But the same will be true of data that is not
centered and scaled, and we donât provide this automatically (though it may
not be a bad idea as an option, given how sensitive this non-convex (whenever
there are hidden layers) function can be to unscaled data). And so thereâs a
question of how much we hold the userâs hand. I advocate for helpful defaults
that can be overridden, where we scale automatically but give an option to run
without scaling and donât run autoscaling if both the min and max are
provided by the user.
## Naming
Lastly thereâs the naming of the multiLayerPerceptron/
multilayerPerceptronRegresson function in the FeedForwardTopology class in
Layer.scala. For consistency it may make sense to change multiLayerPerceptron
to multiLayerPerceptronClassifier.
## Features
There are a few features that have been checked:
1. Integrates cleanly with pipeline API
2. Model save/load is enabled
3. The example data is the popular LoadBoston dataset, scaled.
4. Example code is included
## Reference Resources
Christopher M. Bishop. Neural Networks for Pattern Recognition.
Patrick Nicolas. Scala for Machine Learning, Chapter 9.
Ian Goodfellow Yoshua Bengio and Aaron Courville. Deep Learning, Chapter 6.
## How was this patch tested?
The unit tests follow MLPC with the addition of a test for gradient
descent. There are unit tests for:
1. L-BFGS behavior on toy data
2. Gradient descent on toy data
3. Input Validation
4. Set Weights Parameter
5. Save/Load Functionality working
6. Read / Write returns a model with similar layers and weights
7. Support for all Numeric Types
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/JeremyNixon/spark dnnr
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/13617.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #13617
----
commit 979505c27b5d8c95cf2050cd0863137749ccc0fe
Author: JeremyNixon <[email protected]>
Date: 2016-05-03T15:53:17Z
working version of mlpr
commit abfe50d7c4c82a75411d5601a2a80f81257742fa
Author: JeremyNixon <[email protected]>
Date: 2016-05-03T21:30:35Z
refactor, enable mlpc to run simultaneously, remove commented code
commit c73eb7bdc8e5ee959dcfa7a55904458634037091
Author: JeremyNixon <[email protected]>
Date: 2016-05-21T01:12:32Z
update with ml Vector
commit 583febcb5f4b9ff87e70c2b973c1b1ff8a889654
Author: JeremyNixon <[email protected]>
Date: 2016-06-06T13:27:29Z
working with gd, updated with save-load
commit 080bedb7b3a718204b1368b45f9d4b191e5aeb22
Author: JeremyNixon <[email protected]>
Date: 2016-06-06T17:40:47Z
add additional test for gradient descent
commit 982d08cfaff79b2701edf416e34a643a229407e8
Author: JeremyNixon <[email protected]>
Date: 2016-06-07T21:58:18Z
add validation for min = max, update tests
commit 85b47269fa9a6efa49bc742ef87e2cd3dc46dbd4
Author: JeremyNixon <[email protected]>
Date: 2016-06-10T17:06:24Z
top to bottom review of each file, add example code
commit b5f90e5e5a42450ce9b7d067878a0a1eda68e414
Author: JeremyNixon <[email protected]>
Date: 2016-06-10T19:04:34Z
update testing suite
commit 8a3f984cab0a5ba14dc0f3008616a95d32280f68
Author: JeremyNixon <[email protected]>
Date: 2016-06-10T21:58:28Z
efficiently autocompute min and max
commit 46783acdb5de62530f1cfdc9c69a54f969d42d7e
Author: JeremyNixon <[email protected]>
Date: 2016-06-11T06:52:58Z
Clean up loose new lines. Make comments more readable.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]