Github user BigCrunsh commented on the pull request:
https://github.com/apache/spark/pull/2137#issuecomment-53702702
@mengxr, might it be that you mistake logistic regression for Naive Bayes?
Logistic regression typically predicts well-calibrated probabilities, see e.g.
[1]; it might only be problematic if the data can be separated perfectly. The
learning algorithm returns ("is responsible for") a model that maximizes the
likelihood of the data under the model assumption; in classification, the
returned "probability" measures how likely it is that a certain label is
generated by the learned model for a given example. Adding an isotonic
regression is a good idea anyways.
I think we should definitely distinguish between the output of the linear
model (score) and the calibrated value (probability); it depends on the task,
which one of them is needed. Furthermore, having a function that changes the
type of output depending on the model is misleading. E.g, one should expect
that a score function always returns an arbitrary real value and that the
calibrated version returns a value between zero and one. Sklearn [2] for
example makes this distinctions too: ``decision_function`` for scores,
``predict`` for class labels, ``predict_proba`` for probability estimates.
However, it is not obvious what ``predict`` returns (@mengxr: what do you mean
with "raw predictions"). My suggestion would be:
- ``classify`` or ``predictClass`` for the class;
- ``score`` or ``decisionValue`` or ``predictScore`` for the outcome of the
linear model;
- ``probabilityEstimate`` or ``predictProbability`` for an estimate of the
class probability.
Perhaps ``predict`` could return the class for classification and the
regression value for regression tasks (or just be maintained as deprecated
version).
[1] Niculescu-Mizil, Alexandru, and Rich Caruana. "Predicting good
probabilities with supervised learning." Proceedings of the 22nd international
conference on Machine learning. ACM, 2005.
[2]
http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]