GitHub user hhbyyh opened a pull request:
https://github.com/apache/spark/pull/5661
[Spark-7090][MLlib] Introduce LDAOptimizer to LDA to further improve
extensibility
jira: https://issues.apache.org/jira/browse/SPARK-7090
LDA was implemented with extensibility in mind. And with the development of
OnlineLDA and Gibbs Sampling, we are collecting more detailed requirements from
different algorithms.
As Joseph Bradley proposed in https://github.com/apache/spark/pull/4807 and
with some further discussion, we'd like to adjust the code structure a little
to present the common interface and extension point clearly.
Basically class LDA would be a common entrance for LDA computing. And each
LDA object will refer to a LDAOptimizer for the concrete algorithm
implementation. Users can customize LDAOptimizer with specific parameters and
assign it to LDA.
Concrete changes:
1. Add a trait `LDAOptimizer`, which defines the common iterface for
concrete implementations. Each subClass is a wrapper for a specific LDA
algorithm.
2. Move EMOptimizer to file LDAOptimizer and inherits from LDAOptimizer,
rename to EMLDAOptimizer. (in case a more generic EMOptimizer comes in the
future)
-adjust the constructor of EMOptimizer, since all the parameters
should be passed in through initialState method. This can avoid unwanted
confusion or overwrite.
-move the code from LDA.initalState to initalState of EMLDAOptimizer
3. Add property ldaOptimizer to LDA and its getter/setter, and
EMLDAOptimizer is the default Optimizer.
4. Change the return type of LDA.run from DistributedLDAModel to LDAModel.
Further work:
add OnlineLDAOptimizer and other possible Optimizers once ready.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/hhbyyh/spark ldaRefactor
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/5661.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #5661
----
commit ec2f857645bdcabc8f51c310237d0365e7d2230e
Author: Yuhao Yang <[email protected]>
Date: 2015-04-22T12:49:37Z
protoptype for discussion
commit 0bb8400e70011c8f97ece31d395a8c75b15bab4f
Author: Yuhao Yang <[email protected]>
Date: 2015-04-23T11:15:04Z
refactor LDA with Optimizer
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]