Github user squito commented on a diff in the pull request:
https://github.com/apache/spark/pull/6761#discussion_r32524098
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala ---
@@ -113,6 +106,55 @@ class NaiveBayesModel private[mllib] (
}
}
+ def predictProbabilities(testData: RDD[Vector]): RDD[Map[Double,
Double]] = {
+ val bcModel = testData.context.broadcast(this)
+ testData.mapPartitions { iter =>
+ val model = bcModel.value
+ iter.map(model.predictProbabilities)
+ }
+ }
+
+ def predictProbabilities(testData: Vector): Map[Double, Double] = {
+ modelType match {
+ case Multinomial =>
+ val prob = multinomialCalculation(testData)
+ posteriorProbabilities(prob)
+ case Bernoulli =>
+ val prob = bernoulliCalculation(testData)
+ posteriorProbabilities(prob)
+ case _ =>
+ // This should never happen.
+ throw new UnknownError(s"Invalid modelType: $modelType.")
+ }
+ }
+
+ protected[classification] def multinomialCalculation(testData: Vector):
DenseVector = {
+ val prob = thetaMatrix.multiply(testData)
+ BLAS.axpy(1.0, piVector, prob)
+ prob
+ }
+
+ protected[classification] def bernoulliCalculation(testData: Vector):
DenseVector = {
+ testData.foreachActive { (index, value) =>
+ if (value != 0.0 && value != 1.0) {
+ throw new SparkException(
+ s"Bernoulli naive Bayes requires 0 or 1 feature values but found
$testData.")
+ }
+ }
+ val prob = thetaMinusNegTheta.get.multiply(testData)
+ BLAS.axpy(1.0, piVector, prob)
+ BLAS.axpy(1.0, negThetaSum.get, prob)
+ prob
+ }
+
+ protected[classification] def posteriorProbabilities(prob: DenseVector):
Map[Double, Double] = {
+ val maxLogs = max(prob.toBreeze)
+ val minLogs = min(prob.toBreeze)
+ val normalized = prob.toArray.map(e => (e - minLogs) / (maxLogs -
minLogs))
--- End diff --
Hi @acidghost , I'd just like to chime in here to back up what Sean is
saying -- maybe I can express it differently to see if it will help.
Your formula is guaranteeing that the probabilities sum to 1. Also, it is
monotonic, so its keeping the _ordering_ of the probabilities correct (so the
highest log probability is still the highest after your function), so it seem
like its still making the right overall prediction. But as sean has pointed
out, that doesn't mean the actual probability values are correct. You say that
Sean's formula gives you probabilities like `(0.98,0.01,0.01)` -- in fact, that
is quite common with NaiveBayes, it has a tendency to "overestimate" its
confidence. Small differences in log space are often big differences in
non-log space.
Try an example where you have some real probabilities that sum to 1, scale
them by a multiplicative factor, and then apply log. For example:
```
scala> val probs = Array(0.6,0.3,0.1)
probs: Array[Double] = Array(0.6, 0.3, 0.1)
scala> val scale = 0.1
scale: Double = 0.1
scala> val scaledProbs = probs.map{_ * scale}
scaledProbs: Array[Double] = Array(0.06, 0.03, 0.010000000000000002)
scala> val logScaledProbs = probs.map{math.log(_)}
logScaledProbs: Array[Double] = Array(-0.5108256237659907,
-1.2039728043259361, -2.3025850929940455)
```
our goal is to come up with the inverse of this, and get back the original
`probs`. Lets try your function:
```
scala> val logScaledProbs = probs.map{math.log(_)}
logScaledProbs: Array[Double] = Array(-0.5108256237659907,
-1.2039728043259361, -2.3025850929940455)
scala> val maxLog = logScaledProbs.max
maxLog: Double = -0.5108256237659907
scala> val probabilities = logScaledProbs.map(lp => math.exp(lp /
math.abs(maxLog)))
probabilities: Array[Double] = Array(0.36787944117144233,
0.09471191684442327, 0.011025157721529972)
scala> val probSum = probabilities.sum
probSum: Double = 0.47361651573739555
scala> probabilities.map(_ / probSum)
res1: Array[Double] = Array(0.7767453814372874, 0.199975958813349,
0.023278659749363665)
```
Indeed, those final probabilities do sum to 1, and they are in a
"reasonable" range probabilities -- but they still aren't the original
probabilities we started with.
If you want to do this, and have it work when the normalizing factor (aka
the `1 / scale` as I've written it above) is really big, then you can shift by
the max as sean has suggested. I've written this up before for another
project, you can look at the implementation I have here:
https://github.com/squito/sblaj/blob/master/core/src/main/scala/org/sblaj/ArrayUtils.scala
and a unit test here:
https://github.com/squito/sblaj/blob/master/core/src/test/scala/org/sblaj/ArrayUtilsTest.scala#L15
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]