[GitHub] spark pull request: [SPARK-1405] [mllib] Latent Dirichlet Allocati...

jkbradley Thu, 29 Jan 2015 16:35:30 -0800

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/4047#issuecomment-72133253
  
    I just pushed a big update with the following changes:
    
    Added checkpointing to LDA
    * new class PeriodicGraphCheckpointer
    * params checkpointDir, checkpointInterval to LDA
    
    Internal changes to LDA
    * Changed State to be mutable (since it needs to hold the 
PeriodicGraphCheckpointer)
    * Added timing instrumentation
    * Changed DistributedLDAModel not to hold a LearningState
      * This was needed since LearningState needs to hold a 
PeriodicGraphCheckpointer.  We should be able to copy a model, but we cannot 
copy PeriodicGraphCheckpointer instances.
    * Add checks for valid ranges of eta, alpha
    * Rename âLearningStateâ to âEMOptimizerâ
    
    Public changes to LDA
    * Updated naming of describeTopics, and commented out version using String
    * Removed Document type in favor of (Long, Vector)
    * Changed doc ID restriction to be: id must be nonnegative and unique in 
the doc (instead of 0,1,2,...)
    * Renamed params: termSmoothing -> topicConcentration, topicSmoothing -> 
docConcentration
      * Also added aliases alpha, beta
    
    Also, this update includes code from 
[https://github.com/apache/spark/pull/4253] which will be removed once that PR 
gets merged.
    
    These updates fix the issue with iterations taking longer and longer.
    
    Items remaining:
    * run large test again
    * remove LDATiming
    * possibly port vocab computation from LDATiming (if it's faster...to be 
tested soon)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-1405] [mllib] Latent Dirichlet Allocati...

Reply via email to