Seth Hendrickson created SPARK-17476:
----------------------------------------

             Summary: Proper handling for unseen labels in logistic regression 
training.
                 Key: SPARK-17476
                 URL: https://issues.apache.org/jira/browse/SPARK-17476
             Project: Spark
          Issue Type: New Feature
          Components: ML
            Reporter: Seth Hendrickson


Now that logistic regression supports multiclass, it is possible to train on 
data that has {{K}} classes, but one or more of the classes does not appear in 
training. For example,

{code}
(0.0, x1)
(2.0, x2)
...
{code}

Currently, logistic regression assumes that the outcome classes in the above 
dataset have three levels: {{0, 1, 2}}. Since label 1 never appears, it should 
never be predicted. In theory, the coefficients should be zero and the 
intercept should be negative infinity. This can cause problems since we center 
the intercepts after training.

We should discuss whether or not the intercepts actually tend to -infinity in 
practice, and whether or not we should even include them in training. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to