[GitHub] spark pull request: [SPARK-4362] [MLLIB] Make prediction probabili...

squito Tue, 16 Jun 2015 07:12:08 -0700

Github user squito commented on a diff in the pull request:

    https://github.com/apache/spark/pull/6761#discussion_r32524098
  
    --- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala ---
    @@ -113,6 +106,55 @@ class NaiveBayesModel private[mllib] (
         }
       }
     
    +  def predictProbabilities(testData: RDD[Vector]): RDD[Map[Double, 
Double]] = {
    +    val bcModel = testData.context.broadcast(this)
    +    testData.mapPartitions { iter =>
    +      val model = bcModel.value
    +      iter.map(model.predictProbabilities)
    +    }
    +  }
    +
    +  def predictProbabilities(testData: Vector): Map[Double, Double] = {
    +    modelType match {
    +      case Multinomial =>
    +        val prob = multinomialCalculation(testData)
    +        posteriorProbabilities(prob)
    +      case Bernoulli =>
    +        val prob = bernoulliCalculation(testData)
    +        posteriorProbabilities(prob)
    +      case _ =>
    +        // This should never happen.
    +        throw new UnknownError(s"Invalid modelType: $modelType.")
    +    }
    +  }
    +
    +  protected[classification] def multinomialCalculation(testData: Vector): 
DenseVector = {
    +    val prob = thetaMatrix.multiply(testData)
    +    BLAS.axpy(1.0, piVector, prob)
    +    prob
    +  }
    +
    +  protected[classification] def bernoulliCalculation(testData: Vector): 
DenseVector = {
    +    testData.foreachActive { (index, value) =>
    +      if (value != 0.0 && value != 1.0) {
    +        throw new SparkException(
    +          s"Bernoulli naive Bayes requires 0 or 1 feature values but found 
$testData.")
    +      }
    +    }
    +    val prob = thetaMinusNegTheta.get.multiply(testData)
    +    BLAS.axpy(1.0, piVector, prob)
    +    BLAS.axpy(1.0, negThetaSum.get, prob)
    +    prob
    +  }
    +
    +  protected[classification] def posteriorProbabilities(prob: DenseVector): 
Map[Double, Double] = {
    +    val maxLogs = max(prob.toBreeze)
    +    val minLogs = min(prob.toBreeze)
    +    val normalized = prob.toArray.map(e => (e - minLogs) / (maxLogs - 
minLogs))
    --- End diff --
    
    Hi @acidghost , I'd just like to chime in here to back up what Sean is 
saying -- maybe I can express it differently to see if it will help.
    
    Your formula is guaranteeing that the probabilities sum to 1.  Also, it is 
monotonic, so its keeping the _ordering_ of the probabilities correct (so the 
highest log probability is still the highest after your function), so it seem 
like its still making the right overall prediction.  But as sean has pointed 
out, that doesn't mean the actual probability values are correct.  You say that 
Sean's formula gives you probabilities like `(0.98,0.01,0.01)` -- in fact, that 
is quite common with NaiveBayes, it has a tendency to "overestimate" its 
confidence.  Small differences in log space are often big differences in 
non-log space.
    
    Try an example where you have some real probabilities that sum to 1, scale 
them by a multiplicative factor, and then apply log.  For example:
    
    ```
    scala> val probs = Array(0.6,0.3,0.1)
    probs: Array[Double] = Array(0.6, 0.3, 0.1)
    
    scala> val scale = 0.1
    scale: Double = 0.1
    
    scala> val scaledProbs = probs.map{_ * scale}
    scaledProbs: Array[Double] = Array(0.06, 0.03, 0.010000000000000002)
    
    scala> val logScaledProbs = probs.map{math.log(_)}
    logScaledProbs: Array[Double] = Array(-0.5108256237659907, 
-1.2039728043259361, -2.3025850929940455)
    ```
    
    our goal is to come up with the inverse of this, and get back the original 
`probs`.  Lets try your function:
    
    ```
    scala> val logScaledProbs = probs.map{math.log(_)}
    logScaledProbs: Array[Double] = Array(-0.5108256237659907, 
-1.2039728043259361, -2.3025850929940455)
    
    scala> val maxLog = logScaledProbs.max
    maxLog: Double = -0.5108256237659907
    
    scala> val probabilities = logScaledProbs.map(lp => math.exp(lp / 
math.abs(maxLog)))
    probabilities: Array[Double] = Array(0.36787944117144233, 
0.09471191684442327, 0.011025157721529972)
    
    scala> val probSum = probabilities.sum
    probSum: Double = 0.47361651573739555
    
    scala> probabilities.map(_ / probSum)
    res1: Array[Double] = Array(0.7767453814372874, 0.199975958813349, 
0.023278659749363665)
    ```
    
    Indeed, those final probabilities do sum to 1, and they are in a 
"reasonable" range probabilities -- but they still aren't the original 
probabilities we started with.
    
    If you want to do this, and have it work when the normalizing factor (aka 
the `1 / scale` as I've written it above) is really big, then you can shift by 
the max as sean has suggested.  I've written this up before for another 
project, you can look at the implementation I have here:
    
    
https://github.com/squito/sblaj/blob/master/core/src/main/scala/org/sblaj/ArrayUtils.scala
    
    and a unit test here:
    
    
https://github.com/squito/sblaj/blob/master/core/src/test/scala/org/sblaj/ArrayUtilsTest.scala#L15



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-4362] [MLLIB] Make prediction probabili...

Reply via email to