Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/2137#issuecomment-54253861
  
    @BigCrunsh No, I meant logistic regression. As you mentioned, LR's output 
will be off by a huge margin when the points are easily separable. There are 
other cases that bias the output, for example, unbalanced label distribution 
and different regularization. My point is that we shouldn't interpret LR's 
output as probabilities without calibration, and calling it `predictProb` is 
misleading. A user may use the output directly to estimate, e.g., expect 
revenue, which is wrong.
    
    I will be really helpful if you are interested in implementing isotonic 
regression, which was used as the baseline in [1].
    
    How about adding a new method called `predictRaw` to `ClassificationModel`:
    
    ~~~
    def predictRaw(point: Vector): Vector
    ~~~
    
    It returns an array of size `numClasses`, which contains the 
confidence/margin/probability score for each class. For LR, this is the output 
from logistic regression. For SVM, this is `[margin, -margin]`. For 
classification tree, this is probability for each class at the leaf node.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to