Github user mengxr commented on the pull request:
https://github.com/apache/spark/pull/2137#issuecomment-54253861
@BigCrunsh No, I meant logistic regression. As you mentioned, LR's output
will be off by a huge margin when the points are easily separable. There are
other cases that bias the output, for example, unbalanced label distribution
and different regularization. My point is that we shouldn't interpret LR's
output as probabilities without calibration, and calling it `predictProb` is
misleading. A user may use the output directly to estimate, e.g., expect
revenue, which is wrong.
I will be really helpful if you are interested in implementing isotonic
regression, which was used as the baseline in [1].
How about adding a new method called `predictRaw` to `ClassificationModel`:
~~~
def predictRaw(point: Vector): Vector
~~~
It returns an array of size `numClasses`, which contains the
confidence/margin/probability score for each class. For LR, this is the output
from logistic regression. For SVM, this is `[margin, -margin]`. For
classification tree, this is probability for each class at the leaf node.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]