Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/4047#issuecomment-72133253
I just pushed a big update with the following changes:
Added checkpointing to LDA
* new class PeriodicGraphCheckpointer
* params checkpointDir, checkpointInterval to LDA
Internal changes to LDA
* Changed State to be mutable (since it needs to hold the
PeriodicGraphCheckpointer)
* Added timing instrumentation
* Changed DistributedLDAModel not to hold a LearningState
* This was needed since LearningState needs to hold a
PeriodicGraphCheckpointer. We should be able to copy a model, but we cannot
copy PeriodicGraphCheckpointer instances.
* Add checks for valid ranges of eta, alpha
* Rename âLearningStateâ to âEMOptimizerâ
Public changes to LDA
* Updated naming of describeTopics, and commented out version using String
* Removed Document type in favor of (Long, Vector)
* Changed doc ID restriction to be: id must be nonnegative and unique in
the doc (instead of 0,1,2,...)
* Renamed params: termSmoothing -> topicConcentration, topicSmoothing ->
docConcentration
* Also added aliases alpha, beta
Also, this update includes code from
[https://github.com/apache/spark/pull/4253] which will be removed once that PR
gets merged.
These updates fix the issue with iterations taking longer and longer.
Items remaining:
* run large test again
* remove LDATiming
* possibly port vocab computation from LDATiming (if it's faster...to be
tested soon)
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]