[GitHub] spark pull request: [SPARK-5563][mllib] online lda initial checkin

jkbradley Tue, 28 Apr 2015 15:46:34 -0700

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4419#discussion_r29296382
  
    --- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
    @@ -208,3 +225,224 @@ class EMLDAOptimizer extends LDAOptimizer{
         new DistributedLDAModel(this, iterationTimes)
       }
     }
    +
    +
    +/**
    + * :: Experimental ::
    + *
    + * An online optimizer for LDA. The Optimizer implements the Online LDA 
algorithm, which
    + * processes a subset of the corpus by each call to next, and update the 
term-topic
    + * distribution adaptively.
    + *
    + * References:
    + *   Hoffman, Blei and Bach, "Online Learning for Latent Dirichlet 
Allocation." NIPS, 2010.
    + */
    +@Experimental
    +class OnlineLDAOptimizer extends LDAOptimizer {
    +
    +  // LDA common parameters
    +  private var k: Int = 0
    +  private var D: Int = 0
    +  private var vocabSize: Int = 0
    +  private var alpha: Double = 0
    +  private var eta: Double = 0
    +  private var randomSeed: Long = 0
    +
    +  // Online LDA specific parameters
    +  private var tau_0: Double = -1
    +  private var kappa: Double = -1
    +  private var batchSize: Int = -1
    +
    +  // internal data structure
    +  private var docs: RDD[(Long, Vector)] = null
    +  private var lambda: BDM[Double] = null
    +  private var Elogbeta: BDM[Double]= null
    +  private var expElogbeta: BDM[Double] = null
    +
    +  // count of invocation to next, used to help deciding the weight for 
each iteration
    +  private var iteration = 0
    +
    +  /**
    +   * A (positive) learning parameter that downweights early iterations
    +   */
    +  def getTau_0: Double = {
    +    if (this.tau_0 == -1) {
    +      1024
    +    } else {
    +      this.tau_0
    +    }
    +  }
    +
    +  /**
    +   * A (positive) learning parameter that downweights early iterations
    +   * Automatic setting of parameter:
    +   *  - default = 1024, which follows the recommendation from OnlineLDA 
paper.
    +   */
    +  def setTau_0(tau_0: Double): this.type = {
    +    require(tau_0 > 0 || tau_0 == -1.0,  s"LDA tau_0 must be positive, but 
was set to $tau_0")
    +    this.tau_0 = tau_0
    +    this
    +  }
    +
    +  /**
    +   * Learning rate: exponential decay rate
    +   */
    +  def getKappa: Double = {
    +    if (this.kappa == -1) {
    +      0.5
    +    } else {
    +      this.kappa
    +    }
    +  }
    +
    +  /**
    +   * Learning rate: exponential decay rate---should be between
    +   * (0.5, 1.0] to guarantee asymptotic convergence.
    +   *  - default = 0.5, which follows the recommendation from OnlineLDA 
paper.
    --- End diff --
    
    This documentation is correct, but it may confuse users who see that 0.5 is 
not in the range stated.  How about stating the valid range, but recommending 
the range (0.5, 1.0]?  Also, let's not allow kappa = 0.
    We could also make the default be 0.51 so users don't ask questions.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-5563][mllib] online lda initial checkin

Reply via email to