Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/7705#discussion_r35726249
  
    --- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
    @@ -463,15 +458,73 @@ final class OnlineLDAOptimizer extends LDAOptimizer {
         new BDM[Double](col, row, temp).t
       }
     
    +
    +  override private[clustering] def getLDAModel(iterationTimes: 
Array[Double]): LDAModel = {
    +    new LocalLDAModel(Matrices.fromBreeze(lambda).transpose, alpha, eta, 
gammaShape)
    +  }
    +
    +}
    +
    +/**
    + * Serializable companion object containing helper methods and shared code 
for
    + * [[OnlineLDAOptimizer]] and [[LocalLDAModel]].
    + */
    +object OnlineLDAOptimizer {
       /**
    -   * For theta ~ Dir(alpha), computes E[log(theta)] given alpha. Currently 
the implementation
    -   * uses digamma which is accurate but expensive.
    +   * Uses variational inference to infer the topic distribution `gammad` 
given the term counts
    +   * for a document.
    +   *
    +   * An optimization (Lee, Seung: Algorithms for non-negative matrix 
factorization, NIPS 2001)
    +   * avoids explicit computation of variational parameter `phi`.
    +   * @see http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.31.7566
        */
    -  private def dirichletExpectation(alpha: BDM[Double]): BDM[Double] = {
    -    val rowSum = sum(alpha(breeze.linalg.*, ::))
    -    val digAlpha = digamma(alpha)
    -    val digRowSum = digamma(rowSum)
    -    val result = digAlpha(::, breeze.linalg.*) - digRowSum
    -    result
    +  private[clustering] def variationalTopicInference(
    +      termCounts: Vector,
    +      expElogbeta: BDM[Double],
    +      alpha: breeze.linalg.Vector[Double],
    +      gammaShape: Double,
    +      k: Int): (BDV[Double], BDM[Double]) = {
    +    val (ids: List[Int], cts: Array[Double]) = termCounts match {
    +      case v: DenseVector => ((0 until v.size).toList, v.values)
    +      case v: SparseVector => (v.indices.toList, v.values)
    +      case v => throw new IllegalArgumentException("Online LDA does not 
support vector type "
    +        + v.getClass)
    +    }
    +    // Initialize the variational distribution q(theta|gamma) for the 
mini-batch
    +    val gammad: BDV[Double] =
    +      new Gamma(gammaShape, 1.0 / gammaShape).samplesVector(k)             
      // K
    +    val expElogthetad: BDV[Double] = 
exp(LDAUtils.dirichletExpectation(gammad))  // K
    +    val expElogbetad = expElogbeta(ids, ::).toDenseMatrix                  
      // ids * K
    +
    +    val phinorm: BDV[Double] = expElogbetad * expElogthetad + 1e-100       
      // ids
    +    var meanchange = 1D
    +    val ctsVector = new BDV[Double](cts)                                   
      // ids
    +
    +    // Iterate between gamma and phi until convergence
    +    while (meanchange > 1e-3) {
    +      val lastgamma = gammad.copy
    +      //        K                  K * ids               ids
    +      gammad := (expElogthetad :* (expElogbetad.t * (ctsVector / 
phinorm))) + alpha
    --- End diff --
    
    I'd prefer not to change Breeze operators if there is no need, even if they 
are equivalent, since we could accidentally introduce errors.  Feel free to 
leave what you have; this is just a note for the future.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to