Github user dbtsai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3833#discussion_r23743197
  
    --- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/optimization/Gradient.scala ---
    @@ -55,24 +57,79 @@ abstract class Gradient extends Serializable {
     
     /**
      * :: DeveloperApi ::
    - * Compute gradient and loss for a logistic loss function, as used in 
binary classification.
    - * See also the documentation for the precise formulation.
    + * Compute gradient and loss for a multinomial logistic loss function, as 
used
    + * in multi-class classification (it is also used in binary logistic 
regression).
    + *
    + * In `The Elements of Statistical Learning: Data Mining, Inference, and 
Prediction, 2nd Edition`
    + * by Trevor Hastie, Robert Tibshirani, and Jerome Friedman, which can be 
downloaded from
    + * http://statweb.stanford.edu/~tibs/ElemStatLearn/ , Eq. (4.17) on page 
119 gives the formula of
    + * multinomial logistic regression model. A simple calculation shows that
    + *
    + * P(y=0|x, w) = 1 / (1 + \sum_i^{K-1} \exp(x w_i))
    + * P(y=1|x, w) = exp(x w_1) / (1 + \sum_i^{K-1} \exp(x w_i))
    + *   ...
    + * P(y=K-1|x, w) = exp(x w_{K-1}) / (1 + \sum_i^{K-1} \exp(x w_i))
    + *
    + * for K classes multiclass classification problem.
    + *
    + * The model weights w = (w_1, w_2, ..., w_{K-1})^T becomes a matrix which 
has dimension of
    + * (K-1) * (N+1) if the intercepts are added. If the intercepts are not 
added, the dimension
    + * will be (K-1) * N.
    + *
    + * As a result, the loss of objective function for a single instance of 
data can be written as
    + * l(w, x) = -log P(y|x, w) = -\alpha(y) log P(y=0|x, w) - (1-\alpha(y)) 
log P(y|x, w)
    + *         = log(1 + \sum_i^{K-1}\exp(x w_i)) - (1-\alpha(y)) x w_{y-1}
    + *         = log(1 + \sum_i^{K-1}\exp(margins_i)) - (1-\alpha(y)) 
margins_{y-1}
    + *
    + * where \alpha(i) = 1 if i != 0, and
    + *       \alpha(i) = 0 if i == 0,
    + *       margins_i = x w_i.
    + *
    + * For optimization, we have to calculate the first derivative of the loss 
function, and
    + * a simple calculation shows that
    + *
    + * \frac{\partial l(w, x)}{\partial w_{ij}}
    + *   = (\exp(x w_i) / (1 + \sum_k^{K-1} \exp(x w_k)) - 
(1-\alpha(y)\delta_{y, i+1})) * x_j
    + *   = multiplier_i * x_j
    + *
    + * where \delta_{i, j} = 1 if i == j,
    + *       \delta_{i, j} = 0 if i != j, and
    + *       multiplier
    + *         = \exp(margins_i) / (1 + \sum_k^{K-1} \exp(margins_i)) - 
(1-\alpha(y)\delta_{y, i+1})
    + *
    + * If any of margins is larger than 709.78, the numerical computation of 
multiplier and loss
    + * function will be suffered from arithmetic overflow. This issue occurs 
when there are outliers
    + * in data which are far away from hyperplane, and this will cause the 
failing of training once
    + * infinity / infinity is introduced. Note that this is only a concern 
when max(margins) > 0.
    + *
    + * Fortunately, when max(margins) = maxMargin > 0, the loss function and 
the multiplier can be
    + * easily rewritten into the following equivalent numerically stable 
formula.
    + *
    + * l(w, x) = log(1 + \sum_i^{K-1}\exp(margins_i)) - (1-\alpha(y)) 
margins_{y-1}
    + *         = log(\exp(-maxMargin) + \sum_i^{K-1}\exp(margins_i - 
maxMargin)) + maxMargin
    + *           - (1-\alpha(y)) margins_{y-1}
    + *         = log(1 + sum) + maxMargin - (1-\alpha(y)) margins_{y-1}
    + *
    + * where sum = \exp(-maxMargin) + \sum_i^{K-1}\exp(margins_i - maxMargin) 
- 1.
    + *
    + * Note that each term, (margins_i - maxMargin) in \exp is smaller than 
zero; as a result,
    + * overflow will not happen with this formula.
    + *
    + * For multiplier, similar trick can be applied as the following,
    + *
    + * multiplier = \exp(margins_i) / (1 + \sum_k^{K-1} \exp(margins_i)) - 
(1-\alpha(y)\delta_{y, i+1})
    + *            = \exp(margins_i - maxMargin) / (1 + sum) - 
(1-\alpha(y)\delta_{y, i+1})
    + *
    + * where each term in \exp is also smaller than zero, so overflow is not a 
concern.
    --- End diff --
    
    The typos are missing indices or something minor like that. The approach 
and picture are correct.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to