Seth Hendrickson created SPARK-17476:
----------------------------------------
Summary: Proper handling for unseen labels in logistic regression
training.
Key: SPARK-17476
URL: https://issues.apache.org/jira/browse/SPARK-17476
Project: Spark
Issue Type: New Feature
Components: ML
Reporter: Seth Hendrickson
Now that logistic regression supports multiclass, it is possible to train on
data that has {{K}} classes, but one or more of the classes does not appear in
training. For example,
{code}
(0.0, x1)
(2.0, x2)
...
{code}
Currently, logistic regression assumes that the outcome classes in the above
dataset have three levels: {{0, 1, 2}}. Since label 1 never appears, it should
never be predicted. In theory, the coefficients should be zero and the
intercept should be negative infinity. This can cause problems since we center
the intercepts after training.
We should discuss whether or not the intercepts actually tend to -infinity in
practice, and whether or not we should even include them in training.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]