Github user JeremyNixon commented on the issue:
https://github.com/apache/spark/pull/13617
@jkbradley, of course youâre welcome for the PR! Iâd be happy to
discuss a few use cases.
Among MLlib algorithms, MLPR has the unique ability to generalize to unseen
features that have a nonlinear relationship with the output. Examples of
learning relationships such as x^2 = y and x1*x2=y show that its performance is
exceptionally better on such problems whenever the features leave the range
that the model was trained on. These type of relationships show up in almost
every important modeling problem.
Anyone looking to put a model into production who wants to perform well on
new data that is seen that isnât well represented in training needs an
algorithm that can generalize to that range. Below is a classic example - two
variables in the dataset interact to predict the outcome variable well. Within
the range of the training data, MLPRâs performance is on par with Gradient
Boosting, Random Forests and Linear Regression. But outside the range of the
training data, the tree based models are incapable of generalizing. Linear
regression is only capable of generalizing simple linear relationships, so it
forces the user to manually encode the complex relationships they want modeled.
And so because MLPR is capable of automatically modeling the target as a
non-linear function of features with a structure that generalizes well, it
outperforms every other algorithm in MLlib in a context like this.

MLPR also shows consistent, robust performance on standard datasets. Below
are examples of its performance relative to other models on Load Boston,
Diabetes, and Iris (Avaliable here:
http://scikit-learn.org/stable/datasets/#toy-datasets). All models are using
their default parameters (tanh activations and 50 neurons in a single hidden
layer for MLPR) and are evaluated using RMSE. Train/Test split is a random
70/30 split with no validation set. All data is scaled (mean and std) in
preprocessing.
Load Boston
NN - 3.87
DT - 4.17
RF - 3.23
GBT - 4.34
L2 LR - 4.4
Diabetes
NN - 51.3
DT - 65.2
RF - 55.6
GBT 67.4
L2 LR - 52.24
Iris (Predicting Sepal Length)
NN - 0.376
DT - 0.451
RF - 0.386
GBT - 0.444
LR - 0.295
Together these properties (generalization to unseen feature values +
consistent performance) make it a valuable algorithm to have in a production
systems that demands robust predictions. It learns a very different type of
structure from the decision tree based models already in MLlib, and so has
value as a part of an ensemble whether or not it has the highest predictive
score on the validation data. Situations where it does have the best predictive
score are clear use cases.
You bring up improvements to classification as well. One downside to the
current implementation of MLPC is that it forces users to use a Sigmoid
activation function, which has the unfortunate property of saturating the
gradients. I provide support here for the more modern Tanh, Relu and Linear
activations which gives the user options for zero-centered activations which do
not kill gradients that can speed up convergence dramatically. These benefits
will go to MLPR and MLPC, and should be included regardless of the decision on
the MLPR API.
With a linear activation/layer and squared error loss included, the library
has all the functionality necessary to run MLPR. That functionality already
effectively exists in the library - all of the critical components, from the
topology to the optimizer to the activation functions are already supported and
maintained in MLlib. All we require is an API to call the algorithm.
That API could be as minimal as a single parameter to MLPC that replaces
the last layer with a linear layer w/ squared error.
The downside to that is inconsistency with the rest of MLlib and skimping
on automated scaling which will put users through a lot more work or risk them
getting extremely poor results from misuse. The naming may also lead to
confusion, where the user would be doing regression with an algorithm named for
classification.
The current proposed API is consistent with the rest of MLlib and with
MLPC. It enables automated scaling and gives users a consistent experience, and
so I recommend it. I can understand wanting the algorithm without having to
support another API, and so we can entertain more flexible options if that
looks attractive.
I entirely understand w.r.t. 2.0 QA - I look forward to hearing the
thoughts of @avulanov and @mengxr!
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]