Github user JeremyNixon commented on the issue:

    https://github.com/apache/spark/pull/13617
  
    @jkbradley, of course you’re welcome for the PR! I’d be happy to 
discuss a few use cases.
    
    Among MLlib algorithms, MLPR has the unique ability to generalize to unseen 
features that have a nonlinear relationship with the output. Examples of 
learning relationships such as x^2 = y and x1*x2=y show that its performance is 
exceptionally better on such problems whenever the features leave the range 
that the model was trained on. These type of relationships show up in almost 
every important modeling problem. 
    
    Anyone looking to put a model into production who wants to perform well on 
new data that is seen that isn’t well represented in training needs an 
algorithm that can generalize to that range. Below is a classic example - two 
variables in the dataset interact to predict the outcome variable well. Within 
the range of the training data, MLPR’s performance is on par with Gradient 
Boosting, Random Forests and Linear Regression. But outside the range of the 
training data, the tree based models are incapable of generalizing. Linear 
regression is only capable of generalizing simple linear relationships, so it 
forces the user to manually encode the complex relationships they want modeled. 
And so because MLPR is capable of automatically modeling the target as a 
non-linear function of features with a structure that generalizes well, it 
outperforms every other algorithm in MLlib in a context like this.
    
     
    ![screen shot 2016-06-19 at 11 28 52 
pm](https://cloud.githubusercontent.com/assets/4738024/16188317/b38cb80a-3689-11e6-9619-35bc0a3c9020.png)
    
    
    MLPR also shows consistent, robust performance on standard datasets. Below 
are examples of its performance relative to other models on Load Boston, 
Diabetes, and Iris (Avaliable here: 
http://scikit-learn.org/stable/datasets/#toy-datasets). All models are using 
their default parameters (tanh activations and 50 neurons in a single hidden 
layer for MLPR)  and are evaluated using RMSE. Train/Test split is a random 
70/30 split with no validation set. All data is scaled (mean and std) in 
preprocessing.
    
    Load Boston
    NN - 3.87
    DT - 4.17
    RF - 3.23
    GBT - 4.34
    L2 LR  - 4.4
    
    
    Diabetes
    NN - 51.3
    DT - 65.2
    RF - 55.6
    GBT 67.4
    L2 LR - 52.24
    
    Iris (Predicting Sepal Length)
    NN - 0.376
    DT - 0.451
    RF - 0.386
    GBT - 0.444
    LR - 0.295
    
    Together these properties (generalization to unseen feature values + 
consistent performance) make it a valuable algorithm to have in a production 
systems that demands robust predictions. It learns a very different type of 
structure from the decision tree based models already in MLlib, and so has 
value as a part of an ensemble whether or not it has the highest predictive 
score on the validation data. Situations where it does have the best predictive 
score are clear use cases.
    
    You bring up improvements to classification as well. One downside to the 
current implementation of MLPC is that it forces users to use a Sigmoid 
activation function, which has the unfortunate property of saturating the 
gradients. I provide support here for the more modern Tanh, Relu and Linear 
activations which gives the user options for zero-centered activations which do 
not kill gradients that can speed up convergence dramatically. These benefits 
will go to MLPR and MLPC, and should be included regardless of the decision on 
the MLPR API.
    
    With a linear activation/layer and squared error loss included, the library 
has all the functionality necessary to run MLPR. That functionality already 
effectively exists in the library - all of the critical components, from the 
topology to the optimizer to the activation functions are already supported and 
maintained in MLlib. All we require is an API to call the algorithm. 
    
    That API could be as minimal as a single parameter to MLPC that replaces 
the last layer with a linear layer w/ squared error. 
    
    The downside to that is inconsistency with the rest of MLlib and skimping 
on automated scaling which will put users through a lot more work or risk them 
getting extremely poor results from misuse. The naming may also lead to 
confusion, where the user would be doing regression with an algorithm named for 
classification.
    
    The current proposed API is consistent with the rest of MLlib and with 
MLPC. It enables automated scaling and gives users a consistent experience, and 
so I recommend it. I can understand wanting the algorithm without having to 
support another API, and so we can entertain more flexible options if that 
looks attractive.
    
    I entirely understand w.r.t. 2.0 QA - I look forward to hearing the 
thoughts of @avulanov and @mengxr!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to