spark git commit: [SPARK-19110][ML][MLLIB] DistributedLDAModel returns different logPrior for original and loaded model

jkbradley Sat, 07 Jan 2017 11:10:32 -0800

Repository: spark
Updated Branches:
  refs/heads/branch-2.1 ee735a8a8 -> 86b66216d



[SPARK-19110][ML][MLLIB] DistributedLDAModel returns different logPrior for 
original and loaded model

## What changes were proposed in this pull request?

While adding DistributedLDAModel training summary for SparkR, I found that the 
logPrior for original and loaded model is different.
For example, in the test("read/write DistributedLDAModel"), I add the test:
val logPrior = model.asInstanceOf[DistributedLDAModel].logPrior
val logPrior2 = model2.asInstanceOf[DistributedLDAModel].logPrior
assert(logPrior === logPrior2)
The test fails:
-4.394180878889078 did not equal -4.294290536919573

The reason is that `graph.vertices.aggregate(0.0)(seqOp, _ + _)` only returns 
the value of a single vertex instead of the aggregation of all vertices. 
Therefore, when the loaded model does the aggregation in a different order, it 
returns different `logPrior`.

Please refer to #16464 for details.
## How was this patch tested?
Add a new unit test for testing logPrior.

Author: [email protected] <[email protected]>

Closes #16491 from wangmiao1981/ldabug.

(cherry picked from commit 036b50347c56a3541c526b1270093163b9b79e45)
Signed-off-by: Joseph K. Bradley <[email protected]>


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/86b66216
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/86b66216
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/86b66216

Branch: refs/heads/branch-2.1
Commit: 86b66216de411f8cbc79ede62b353f7cbb550903
Parents: ee735a8
Author: [email protected] <[email protected]>
Authored: Sat Jan 7 11:07:49 2017 -0800
Committer: Joseph K. Bradley <[email protected]>
Committed: Sat Jan 7 11:08:02 2017 -0800

----------------------------------------------------------------------
 .../scala/org/apache/spark/mllib/clustering/LDAModel.scala   | 4 ++--
 .../test/scala/org/apache/spark/ml/clustering/LDASuite.scala | 8 ++++++++
 2 files changed, 10 insertions(+), 2 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/86b66216/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala
----------------------------------------------------------------------
diff --git 
a/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala 
b/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala
index 25ffd85..b55f1b1 100644
--- a/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala
+++ b/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala
@@ -745,12 +745,12 @@ class DistributedLDAModel private[clustering] (
           val N_wk = vertex._2
           val smoothed_N_wk: TopicCounts = N_wk + (eta - 1.0)
           val phi_wk: TopicCounts = smoothed_N_wk :/ smoothed_N_k
-          (eta - 1.0) * sum(phi_wk.map(math.log))
+          sumPrior + (eta - 1.0) * sum(phi_wk.map(math.log))
         } else {
           val N_kj = vertex._2
           val smoothed_N_kj: TopicCounts = N_kj + (alpha - 1.0)
           val theta_kj: TopicCounts = normalize(smoothed_N_kj, 1.0)
-          (alpha - 1.0) * sum(theta_kj.map(math.log))
+          sumPrior + (alpha - 1.0) * sum(theta_kj.map(math.log))
         }
     }
     graph.vertices.aggregate(0.0)(seqOp, _ + _)

http://git-wip-us.apache.org/repos/asf/spark/blob/86b66216/mllib/src/test/scala/org/apache/spark/ml/clustering/LDASuite.scala
----------------------------------------------------------------------
diff --git a/mllib/src/test/scala/org/apache/spark/ml/clustering/LDASuite.scala 
b/mllib/src/test/scala/org/apache/spark/ml/clustering/LDASuite.scala
index 3f39ded..9aa11fb 100644
--- a/mllib/src/test/scala/org/apache/spark/ml/clustering/LDASuite.scala
+++ b/mllib/src/test/scala/org/apache/spark/ml/clustering/LDASuite.scala
@@ -260,6 +260,14 @@ class LDASuite extends SparkFunSuite with 
MLlibTestSparkContext with DefaultRead
         Vectors.dense(model2.topicsMatrix.toArray) absTol 1e-6)
       assert(Vectors.dense(model.getDocConcentration) ~==
         Vectors.dense(model2.getDocConcentration) absTol 1e-6)
+      val logPrior = model.asInstanceOf[DistributedLDAModel].logPrior
+      val logPrior2 = model2.asInstanceOf[DistributedLDAModel].logPrior
+      val trainingLogLikelihood =
+        model.asInstanceOf[DistributedLDAModel].trainingLogLikelihood
+      val trainingLogLikelihood2 =
+        model2.asInstanceOf[DistributedLDAModel].trainingLogLikelihood
+      assert(logPrior ~== logPrior2 absTol 1e-6)
+      assert(trainingLogLikelihood ~== trainingLogLikelihood2 absTol 1e-6)
     }
     val lda = new LDA()
     testEstimatorAndModelReadWrite(lda, dataset,


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

spark git commit: [SPARK-19110][ML][MLLIB] DistributedLDAModel returns different logPrior for original and loaded model

Reply via email to