[GitHub] spark pull request: [SPARK-2309][MLlib] Multinomial Logistic Regre...

dbtsai Thu, 29 Jan 2015 20:43:15 -0800

Github user dbtsai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3833#discussion_r23823903
  
    --- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/classification/LogisticRegressionSuite.scala
 ---
    @@ -55,6 +56,97 @@ object LogisticRegressionSuite {
         val testData = (0 until nPoints).map(i => LabeledPoint(y(i), 
Vectors.dense(Array(x1(i)))))
         testData
       }
    +
    +  /**
    +   * Generates `k` classes multinomial synthetic logistic input in `n` 
dimensional space given the
    +   * model weights and mean/variance of the features. The synthetic data 
will be drawn from
    +   * the probability distribution constructed by weights using the 
following formula.
    +   *
    +   * P(y = 0 | x) = 1 / norm
    +   * P(y = 1 | x) = exp(x * w_1) / norm
    +   * P(y = 2 | x) = exp(x * w_2) / norm
    +   * ...
    +   * P(y = k-1 | x) = exp(x * w_{k-1}) / norm
    +   * where norm = 1 + exp(x * w_1) + exp(x * w_2) + ... + exp(x * w_{k-1})
    +   *
    +   * @param weights matrix is flatten into a vector; as a result, the 
dimension of weights vector
    +   *                will be (k - 1) * (n + 1) if `addIntercept == true`, 
and
    +   *                if `addIntercept != true`, the dimension will be (k - 
1) * n.
    +   * @param xMean the mean of the generated features. Lots of time, if the 
features are not properly
    +   *              standardized, the algorithm with poor implementation 
will have difficulty
    +   *              to converge.
    +   * @param xVariance the variance of the generated features.
    +   * @param addIntercept whether to add intercept.
    +   * @param nPoints the number of instance of generated data.
    +   * @param seed the seed for random generator. For consistent testing 
result, it will be fixed.
    +   */
    +  def generateMultinomialLogisticInput(
    +      weights: Array[Double],
    +      xMean: Array[Double],
    +      xVariance: Array[Double],
    +      addIntercept: Boolean,
    +      nPoints: Int,
    +      seed: Int): Seq[LabeledPoint] = {
    +    val rnd = new Random(seed)
    +
    +    val xDim = xMean.size
    +    val xWithInterceptsDim = if (addIntercept) xDim + 1 else xDim
    +    val nClasses = weights.size / xWithInterceptsDim + 1
    +
    +    val x = 
Array.fill[Vector](nPoints)(Vectors.dense(Array.fill[Double](xDim)(rnd.nextGaussian())))
    +
    +    x.map(vector => {
    +      // This doesn't work if `vector` is a sparse vector.
    +      val vectorArray = vector.toArray
    +      var i = 0
    +      while (i < vectorArray.size) {
    +        vectorArray(i) = vectorArray(i) * math.sqrt(xVariance(i)) + 
xMean(i)
    +        i += 1
    +      }
    +    })
    +
    +    val y = (0 until nPoints).map { idx =>
    +      val xArray = x(idx).toArray
    +      val margins = Array.ofDim[Double](nClasses)
    +      val probs = Array.ofDim[Double](nClasses)
    +
    +      for (i <- 0 until nClasses - 1) {
    +        for (j <- 0 until xDim) margins(i + 1) += weights(i * 
xWithInterceptsDim + j) * xArray(j)
    +        if (addIntercept) margins(i + 1) += weights((i + 1) * 
xWithInterceptsDim - 1)
    +      }
    +      // Preventing the overflow when we compute the probability
    +      val maxMargin = margins.max
    +      if (maxMargin > 0) for (i <-0 until nClasses) margins(i) -= maxMargin
    +
    +      // Computing the probabilities for each class from the margins.
    +      val norm = {
    +        var temp = 0.0
    +        for (i <- 0 until nClasses) {
    +          probs(i) = math.exp(margins(i))
    +          temp += probs(i)
    +        }
    +        temp
    +      }
    +      for (i <-0 until nClasses) probs(i) /= norm
    +
    +      // Compute the cumulative probability so we can generate a random 
number and assign a label.
    +      for (i <- 1 until nClasses) probs(i) += probs(i - 1)
    +      val p = rnd.nextDouble()
    +      var y = 0
    +      breakable {
    +        for (i <- 0 until nClasses) {
    +          if(p < probs(i)) {
    +            y = i
    --- End diff --
    
    add space



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-2309][MLlib] Multinomial Logistic Regre...

Reply via email to