GitHub user EntilZha opened a pull request:
https://github.com/apache/spark/pull/4807
[SPARK-5556][MLLib][WIP] Gibbs LDA, Refactor LDA for multiple LDA
algorithms (EM+Gibbs)
JIRA: https://issues.apache.org/jira/browse/SPARK-5556
As discussed in that issue, it would be great to have multiple LDA
algorithm options, principally EM (implemented already in #4047) and Gibbs.
## Goals of PR:
1. Refactor LDA to allow multiple algorithm options (done)
2. Refactor Gibbs code here to this interface (mostly done):
https://github.com/EntilZha/spark/tree/LDA-Refactor/mllib/src/main/scala/org/apache/spark/mllib/topicmodeling
3. Run the same performance tests run for the EM PR for comparison (todo,
initial smaller tests have been run)
At the moment, I am looking for feedback on the refactoring while working
on putting the Gibbs code in.
## Summary of Changes:
These traits were created with the purpose of encapsulating everything
about implementation, while interfacing with the entry point ```LDA.run``` and
```DistributedLDAModel```.
```scala
private[clustering] trait LearningState {
def next(): LearningState
def topicsMatrix: Matrix
def describeTopics(maxTermsPerTopic: Int): Array[(Array[Int],
Array[Double])]
def logLikelihood: Double
def logPrior: Double
def topicDistributions: RDD[(Long, Vector)]
def globalTopicTotals: LDA.TopicCounts
def k: Int
def vocabSize: Int
def docConcentration: Double
def topicConcentration: Double
def deleteAllCheckpoints(): Unit
}
trait LearningStateInitializer {
def initialState(
docs: RDD[(Long, Vector)],
k: Int,
docConcentration: Double,
topicConcentration: Double,
randomSeed: Long,
checkpointInterval: Int): LearningState
}
```
The entirety of an LDA implementation can be captured by an object and
class which extend these traits. Specifically, the
```LearningStateInitializer``` provides the method for returning the
```LearningState``` which maintains state.
Lastly, the algorithm can be set via an enum which is pattern matched to
create the correct thing. My thought is the default algorithm should be
whichever performs better.
## Gibbs Implementation
Old design doc is here:
Primary Gibbs algorithm from here (mostly notation/math, GraphX based, not
table based): http://www.cs.ucsb.edu/~mingjia/cs240/doc/273811.pdf
Implements FastLDA from here:
http://www.ics.uci.edu/~newman/pubs/fastlda.pdf
### Specific Points for Feedback
1. Naming, its hard, and I'me not sure if the traits are named appropriately
2. Similarly, I am reasonably familiar with the Scala type system, but
perhaps there is some ninja tricks I don't know that would be helpful
3. General interface/cleanliness
4. Should the LearningStates/etc go within LDA, I think so, thoughts?
5. Anything else, I'me also learning here.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/EntilZha/spark LDA-pull-request
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/4807.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #4807
----
commit 4bd110385f36927990a980085d727f12fc90fd53
Author: Pedro Rodriguez <[email protected]>
Date: 2015-02-26T07:39:47Z
updates to LDA and LDAModel to separate implementation from interface for
LDA algorithms
commit 34d58532b278f95aaab9c250f4a2ed64ea959811
Author: Pedro Rodriguez <[email protected]>
Date: 2015-02-27T05:35:51Z
refactored tests to make tests succeed
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]