Github user sethah commented on a diff in the pull request:
https://github.com/apache/spark/pull/13796#discussion_r75230364
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
---
@@ -933,32 +946,312 @@ class BinaryLogisticRegressionSummary
private[classification] (
}
/**
- * LogisticAggregator computes the gradient and loss for binary logistic
loss function, as used
- * in binary classification for instances in sparse or dense vector in an
online fashion.
- *
- * Note that multinomial logistic loss is not supported yet!
+ * LogisticAggregator computes the gradient and loss for binary or
multinomial logistic (softmax)
+ * loss function, as used in classification for instances in sparse or
dense vector in an online
+ * fashion.
*
- * Two LogisticAggregator can be merged together to have a summary of loss
and gradient of
+ * Two LogisticAggregators can be merged together to have a summary of
loss and gradient of
* the corresponding joint dataset.
*
+ * For improving the convergence rate during the optimization process and
also to prevent against
+ * features with very large variances exerting an overly large influence
during model training,
+ * packages like R's GLMNET perform the scaling to unit variance and
remove the mean in order to
+ * reduce the condition number. The model is then trained in this scaled
space, but returns the
+ * coefficients in the original scale. See page 9 in
+ * http://cran.r-project.org/web/packages/glmnet/glmnet.pdf
+ *
+ * However, we don't want to apply the
[[org.apache.spark.ml.feature.StandardScaler]] on the
+ * training dataset, and then cache the standardized dataset since it will
create a lot of overhead.
+ * As a result, we perform the scaling implicitly when we compute the
objective function (though
+ * we do not subtract the mean).
+ *
+ * Note that there is a difference between multinomial (softmax) and
binary loss. The binary case
+ * uses one outcome class as a "pivot" and regresses the other class
against the pivot. In the
+ * multinomial case, the softmax loss function is used to model each class
probability
+ * independently. Using softmax loss produces `K` sets of coefficients,
while using a pivot class
+ * produces `K - 1` sets of coefficients (a single coefficient vector in
the binary case). In the
+ * binary case, we can say that the coefficients are shared between the
positive and negative
+ * classes. When regularization is applied, multinomial (softmax) loss
will produce a result
+ * different from binary loss since the positive and negative don't share
the coefficients while the
+ * binary regression shares the coefficients between positive and negative.
+ *
+ * The following is a mathematical derivation for the multinomial
(softmax) loss.
+ *
+ * The probability of the multinomial outcome $y$ taking on any of the K
possible outcomes is:
+ *
+ * <p><blockquote>
+ * $$
+ * P(y_i=0|\vec{x}_i, \beta) = \frac{e^{\vec{x}_i^T
\vec{\beta}_0}}{\sum_{k=0}^{K-1}
+ * e^{\vec{x}_i^T \vec{\beta}_k}} \\
+ * P(y_i=1|\vec{x}_i, \beta) = \frac{e^{\vec{x}_i^T
\vec{\beta}_1}}{\sum_{k=0}^{K-1}
+ * e^{\vec{x}_i^T \vec{\beta}_k}}\\
+ * P(y_i=K-1|\vec{x}_i, \beta) = \frac{e^{\vec{x}_i^T
\vec{\beta}_{K-1}}\,}{\sum_{k=0}^{K-1}
+ * e^{\vec{x}_i^T \vec{\beta}_k}}
+ * $$
+ * </blockquote></p>
+ *
+ * The model coefficients $\beta = (\beta_1, \beta_2, ..., \beta_{K-1})$
become a matrix
--- End diff --
done.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]