Github user dbtsai commented on a diff in the pull request:
https://github.com/apache/spark/pull/3833#discussion_r23823903
--- Diff:
mllib/src/test/scala/org/apache/spark/mllib/classification/LogisticRegressionSuite.scala
---
@@ -55,6 +56,97 @@ object LogisticRegressionSuite {
val testData = (0 until nPoints).map(i => LabeledPoint(y(i),
Vectors.dense(Array(x1(i)))))
testData
}
+
+ /**
+ * Generates `k` classes multinomial synthetic logistic input in `n`
dimensional space given the
+ * model weights and mean/variance of the features. The synthetic data
will be drawn from
+ * the probability distribution constructed by weights using the
following formula.
+ *
+ * P(y = 0 | x) = 1 / norm
+ * P(y = 1 | x) = exp(x * w_1) / norm
+ * P(y = 2 | x) = exp(x * w_2) / norm
+ * ...
+ * P(y = k-1 | x) = exp(x * w_{k-1}) / norm
+ * where norm = 1 + exp(x * w_1) + exp(x * w_2) + ... + exp(x * w_{k-1})
+ *
+ * @param weights matrix is flatten into a vector; as a result, the
dimension of weights vector
+ * will be (k - 1) * (n + 1) if `addIntercept == true`,
and
+ * if `addIntercept != true`, the dimension will be (k -
1) * n.
+ * @param xMean the mean of the generated features. Lots of time, if the
features are not properly
+ * standardized, the algorithm with poor implementation
will have difficulty
+ * to converge.
+ * @param xVariance the variance of the generated features.
+ * @param addIntercept whether to add intercept.
+ * @param nPoints the number of instance of generated data.
+ * @param seed the seed for random generator. For consistent testing
result, it will be fixed.
+ */
+ def generateMultinomialLogisticInput(
+ weights: Array[Double],
+ xMean: Array[Double],
+ xVariance: Array[Double],
+ addIntercept: Boolean,
+ nPoints: Int,
+ seed: Int): Seq[LabeledPoint] = {
+ val rnd = new Random(seed)
+
+ val xDim = xMean.size
+ val xWithInterceptsDim = if (addIntercept) xDim + 1 else xDim
+ val nClasses = weights.size / xWithInterceptsDim + 1
+
+ val x =
Array.fill[Vector](nPoints)(Vectors.dense(Array.fill[Double](xDim)(rnd.nextGaussian())))
+
+ x.map(vector => {
+ // This doesn't work if `vector` is a sparse vector.
+ val vectorArray = vector.toArray
+ var i = 0
+ while (i < vectorArray.size) {
+ vectorArray(i) = vectorArray(i) * math.sqrt(xVariance(i)) +
xMean(i)
+ i += 1
+ }
+ })
+
+ val y = (0 until nPoints).map { idx =>
+ val xArray = x(idx).toArray
+ val margins = Array.ofDim[Double](nClasses)
+ val probs = Array.ofDim[Double](nClasses)
+
+ for (i <- 0 until nClasses - 1) {
+ for (j <- 0 until xDim) margins(i + 1) += weights(i *
xWithInterceptsDim + j) * xArray(j)
+ if (addIntercept) margins(i + 1) += weights((i + 1) *
xWithInterceptsDim - 1)
+ }
+ // Preventing the overflow when we compute the probability
+ val maxMargin = margins.max
+ if (maxMargin > 0) for (i <-0 until nClasses) margins(i) -= maxMargin
+
+ // Computing the probabilities for each class from the margins.
+ val norm = {
+ var temp = 0.0
+ for (i <- 0 until nClasses) {
+ probs(i) = math.exp(margins(i))
+ temp += probs(i)
+ }
+ temp
+ }
+ for (i <-0 until nClasses) probs(i) /= norm
+
+ // Compute the cumulative probability so we can generate a random
number and assign a label.
+ for (i <- 1 until nClasses) probs(i) += probs(i - 1)
+ val p = rnd.nextDouble()
+ var y = 0
+ breakable {
+ for (i <- 0 until nClasses) {
+ if(p < probs(i)) {
+ y = i
--- End diff --
add space
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]