[GitHub] spark pull request #13617: [SPARK-10409] [ML] Add Multilayer Perceptron Regr...

JeremyNixon Fri, 10 Jun 2016 23:54:57 -0700

GitHub user JeremyNixon opened a pull request:

    https://github.com/apache/spark/pull/13617


    [SPARK-10409] [ML] Add Multilayer Perceptron Regression to ML

    ## What changes were proposed in this pull request?
    This is a pull request adding support for Multilayer Perceptron Regression, 
the counterpart to the Multilayer Perceptron Classifier (hereafter MLPR and 
MLPC). 
    
    #### Outline
    
    1. Major Changes
    2. API Decisions
    3. Automating Scaling
    4. Naming and Features
    5. Reference Resources
    6. Testing
    
    ## Major Changes
    There are two major differences between MLPR and MLPC. The first is the use 
of an linear (identity) activation function and a sum of squared error cost 
function in the last layer of the network. The second is the requirement to 
scale the data to [0,1] and back to make it easy for the weights to fit a value 
in the proper range. 
    
    #### Linear Activation
    In the forward pass the linear activation passes the value from the fully 
connected layer through to become the network prediction. In weight adjustment 
during the backward pass its derivative is one. All regression models will use 
the linear activation in the last layer, and so there is no option (as there is 
in MLPC) to use another activation function and cost function in the last layer.
    
    #### Automated Scaling
    The data scaling is done through min-max scaling, where the minimum label 
is subtracted from every value (leading to a range from [0 to max-min]) and 
then dividing by max-min to get a scale from 0 to 1. The corner case where 
max-min = 0 is resolved by omitting the division step. 
    
    #### Motivating Examples
    
    <img width="870" alt="screen shot 2016-06-10 at 11 26 18 pm" 
src="https://cloud.githubusercontent.com/assets/4738024/15983614/d1ef0fe2-2f62-11e6-89f1-dc0c0dd6be94.png";>
    
    
    ![screen shot 2016-06-10 at 11 16 46 
pm](https://cloud.githubusercontent.com/assets/4738024/15983589/86643c60-2f61-11e6-840a-7a26dac9948e.png)
    
    
    ## API Decisions
    The API is identical to MLPC with the exception of softmaxOnTop - there is 
no option on the last layer activation function, or on the cost function to be 
used (MLPC gives a choice between cross entropy and sum of square error). This 
API has the user call MLPR with a set of layers that represent the topology of 
the network. The number of hidden layers is inferred by the parameter for the 
labels and is equal to the total number of layers - 2. Each hidden layer will 
be a feedforward layer with a sigmoid activation function up to the output 
layer and its linear activation.
    
    #### Input/Output Layer Argument
    For MLPR, the output count will always be 1, and the number of inputs will 
always be equal to the number of features in the training dataset. One API 
choice could be to omit the input and output counts and only have the user 
supply the number of neurons in the hidden layers, and automate the input and 
output counts by looking at the training data. At the very least, it makes 
sense to validate the userâs layer parameter and display a helpful error 
message instead of the error in the data stacker that currently appears if the 
improper number of inputs or outputs is provided.
    
    #### Modular API
    It also would make sense for the API to be modular. A user will want the 
flexibility to use the linear layer at different points in the network (as well 
as in MLPC), and will certainly want to be able to use new activation functions 
(tanh, relu) that are added to improve the performance of these models. That 
flexibility allows a user to tune the network to their dataset and will be 
particularly important for convnets or recurrent nets in future. 
    
    ## Automating Scaling
    Current behavior is to automatically scale the data for the user. This 
makes a pass over the data. There are a few options. We could:
    
    1. Scale data internally, always.
    2. Scale data internally unless user provides min/max themselves.
    3. Create argument that turns internal scaling off/on. Default it to one or 
the other. Warn user if running on unscaled data. 
    
    As well as all the variants between autoscaling or not, adding an argument 
or not, and warning the user or not. 
    
    The algorithm will run quite poorly on unscaled data, and so it makes sense 
to safeguard the user from this. But the same will be true of data that is not 
centered and scaled, and we donât provide this automatically (though it may 
not be a bad idea as an option, given how sensitive this non-convex (whenever 
there are hidden layers) function can be to unscaled data). And so thereâs a 
question of how much we hold the userâs hand. I advocate for helpful defaults 
that can be overridden, where we scale automatically but give an option to run 
without scaling and donât run autoscaling if both the min and max are 
provided by the user.
    
    ## Naming
    Lastly thereâs the naming of the multiLayerPerceptron/ 
multilayerPerceptronRegresson function in the FeedForwardTopology class in 
Layer.scala. For consistency it may make sense to change multiLayerPerceptron 
to multiLayerPerceptronClassifier. 
    
    ## Features 
    There are a few features that have been checked:
    1. Integrates cleanly with pipeline API
    2. Model save/load is enabled
    3. The example data is the popular LoadBoston dataset, scaled. 
    4. Example code is included
    
    ## Reference Resources
    Christopher M. Bishop. Neural Networks for Pattern Recognition.
    Patrick Nicolas. Scala for Machine Learning, Chapter 9.
    Ian Goodfellow Yoshua Bengio and Aaron Courville. Deep Learning, Chapter 6.
    
    ## How was this patch tested?
    The unit tests follow MLPC with the addition of a test for gradient 
descent. There are unit tests for:
    
    1. L-BFGS behavior on toy data
    2. Gradient descent on toy data
    3. Input Validation
    4. Set Weights Parameter
    5. Save/Load Functionality working
    6. Read / Write returns a model with similar layers and weights
    7. Support for all Numeric Types


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/JeremyNixon/spark dnnr

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/13617.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #13617
    
----
commit 979505c27b5d8c95cf2050cd0863137749ccc0fe
Author: JeremyNixon <[email protected]>
Date:   2016-05-03T15:53:17Z

    working version of mlpr

commit abfe50d7c4c82a75411d5601a2a80f81257742fa
Author: JeremyNixon <[email protected]>
Date:   2016-05-03T21:30:35Z

    refactor, enable mlpc to run simultaneously, remove commented code

commit c73eb7bdc8e5ee959dcfa7a55904458634037091
Author: JeremyNixon <[email protected]>
Date:   2016-05-21T01:12:32Z

    update with ml Vector

commit 583febcb5f4b9ff87e70c2b973c1b1ff8a889654
Author: JeremyNixon <[email protected]>
Date:   2016-06-06T13:27:29Z

    working with gd, updated with save-load

commit 080bedb7b3a718204b1368b45f9d4b191e5aeb22
Author: JeremyNixon <[email protected]>
Date:   2016-06-06T17:40:47Z

    add additional test for gradient descent

commit 982d08cfaff79b2701edf416e34a643a229407e8
Author: JeremyNixon <[email protected]>
Date:   2016-06-07T21:58:18Z

    add validation for min = max, update tests

commit 85b47269fa9a6efa49bc742ef87e2cd3dc46dbd4
Author: JeremyNixon <[email protected]>
Date:   2016-06-10T17:06:24Z

    top to bottom review of each file, add example code

commit b5f90e5e5a42450ce9b7d067878a0a1eda68e414
Author: JeremyNixon <[email protected]>
Date:   2016-06-10T19:04:34Z

    update testing suite

commit 8a3f984cab0a5ba14dc0f3008616a95d32280f68
Author: JeremyNixon <[email protected]>
Date:   2016-06-10T21:58:28Z

    efficiently autocompute min and max

commit 46783acdb5de62530f1cfdc9c69a54f969d42d7e
Author: JeremyNixon <[email protected]>
Date:   2016-06-11T06:52:58Z

    Clean up loose new lines. Make comments more readable.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #13617: [SPARK-10409] [ML] Add Multilayer Perceptron Regr...

Reply via email to