[GitHub] spark pull request: [SPARK-3251][MLLIB]: Clarify learning interfac...

BigCrunsh Thu, 28 Aug 2014 03:31:17 -0700

Github user BigCrunsh commented on the pull request:

    https://github.com/apache/spark/pull/2137#issuecomment-53702702
  
    @mengxr, might it be that you mistake logistic regression for Naive Bayes? 
Logistic regression typically predicts well-calibrated probabilities, see e.g. 
[1]; it might only be problematic if the data can be separated perfectly. The 
learning algorithm returns ("is responsible for") a model that maximizes the 
likelihood of the data under the model assumption; in classification, the 
returned "probability" measures how likely it is that a certain label is 
generated by the learned model for a given example. Adding an isotonic 
regression is a good idea anyways.
    
    I think we should definitely distinguish between the output of the linear 
model (score) and the calibrated value (probability); it depends on the task, 
which one of them is needed. Furthermore, having a function that changes the 
type of output depending on the model is misleading. E.g, one should expect 
that a score function always returns an arbitrary real value and that the 
calibrated version returns a value between zero and one. Sklearn [2] for 
example makes this distinctions too: ``decision_function`` for scores, 
``predict`` for class labels, ``predict_proba`` for probability estimates. 
However, it is not obvious what ``predict`` returns (@mengxr: what do you mean 
with "raw predictions"). My suggestion would be:
    - ``classify`` or ``predictClass`` for the class;
    - ``score`` or ``decisionValue`` or ``predictScore`` for the outcome of the 
linear model;
    - ``probabilityEstimate`` or ``predictProbability`` for an estimate of the 
class probability.
    
    Perhaps ``predict`` could  return the class for classification and the 
regression value for regression tasks (or just be maintained as deprecated 
version).
    
    [1] Niculescu-Mizil, Alexandru, and Rich Caruana. "Predicting good 
probabilities with supervised learning." Proceedings of the 22nd international 
conference on Machine learning. ACM, 2005.
    [2] 
http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-3251][MLLIB]: Clarify learning interfac...

Reply via email to