GitHub user jkbradley opened a pull request:
https://github.com/apache/spark/pull/4047
[SPARK-1405] [mllib] Latent Dirichlet Allocation (LDA) using EM
**This PR introduces an API + simple implementation for Latent Dirichlet
Allocation (LDA).**
The [design doc for this
PR](https://docs.google.com/document/d/1kSsDqTeZMEB94Bs4GTd0mvdAmduvZSSkpoSfn-seAzo)
has been updated since I initially posted it. In particular, see the API and
Planning for the Future sections.
## Goals
* Settle on a public API which may eventually include:
* more inference algorithms
* more options / functionality
* Have an initial easy-to-understand implementation which others may
improve.
* This is NOT intended to support every topic model out there. However, if
there are suggestions for making this extensible or pluggable in the future,
that could be nice, as long as it does not complicate the API or implementation
too much.
* This may not be very scalable currently. It will be important to check
and improve accuracy. For correctness of the implementation, please check
against the Asuncion et al. (2009) paper in the design doc.
## Sketch of contents of this PR
**Dependency: This makes MLlib depend on GraphX.**
Files and classes:
* LDA.scala (441 lines):
* class LDA (main estimator class)
* LDA.Document (text + document ID)
* LDAModel.scala (266 lines)
* abstract class LDAModel
* class LocalLDAModel
* class DistributedLDAModel
* LDAExample.scala (245 lines): script to run LDA + a simple (private)
Tokenizer
* LDASuite.scala (144 lines)
Data/model representation and algorithm:
* Data/model: Uses GraphX, with term vertices + document vertices
* Algorithm: EM, following [Asuncion, Welling, Smyth, and Teh. "On
Smoothing and Inference for Topic Models." UAI,
2009.](http://arxiv-web3.library.cornell.edu/abs/1205.2662v1)
* For more details, please see the description in the âDEVELOPERS NOTEâ
in LDA.scala
## Design notes
Please refer to the JIRA for more discussion + the [design doc for this
PR](https://docs.google.com/document/d/1kSsDqTeZMEB94Bs4GTd0mvdAmduvZSSkpoSfn-seAzo)
Here, I list the main changes AFTER the design doc was posted.
Design decisions:
* logLikelihood() computes the log likelihood of the data and the current
point estimate of parameters. This is different from the likelihood of the
data given the hyperparameters, which would be harder to compute. Iâd
describe the current approach as more frequentist, whereas the harder approach
would be more Bayesian.
* The current API takes Documents as token count vectors. I believe there
should be an extended API taking RDD[String] or RDD[Array[String]] in a future
PR. I have sketched this out in the design doc (as well as handier versions of
getTopics returning Strings).
* Hyperparameters should be set differently for different
inference/learning algorithms. See Asuncion et al. (2009) in the design doc
for a good demonstration. I encourage good behavior via defaults and warning
messages.
Items planned for future PRs:
* perplexity
* API taking Strings
## Questions for reviewers
* Should LDA be called LatentDirichletAllocation (and LDAModel be
LatentDirichletAllocationModel)?
* Pro: We may someday want LinearDiscriminantAnalysis.
* Con: Very long names
* Should LDA reside in clustering? Or do we want a sub-package?
* mllib.topicmodel
* mllib.clustering.topicmodel
* Does the API seem reasonable and extensible?
* Unit tests:
* Should there be a test which checks a clustering results? E.g., train
on a small, fake dataset with 2 very distinct topics/clusters, and ensure LDA
finds those 2 topics/clusters. Does that sound useful or too flaky?
## Other notes
This has not been tested much for scaling. I have run it on a laptop for
200 iterations on a 5MB dataset with 1000 terms and 5 topics. Running it for
500 iterations made it fail because of GC problems. Future PRs will need to
improve the scaling.
## Thanks toâ¦
* @dlwh for the initial implementation
* + @jegonzal for some code in the initial implementation
* The many contributors towards topic model implementations in Spark which
were referenced as a basis for this PR: @akopich @witgo @yinxusen @dlwh
@EntilZha @jegonzal @IlyaKozlov
CC: @mengxr
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/jkbradley/spark davidhall-lda
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/4047.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #4047
----
commit 186eba2736679cdb4072d37fcad296647c2ec1e2
Author: Joseph K. Bradley <[email protected]>
Date: 2014-12-16T23:58:36Z
Added 3 files from dlwh LDA implementation
commit 087d81d73b9c98e2e087005c896d184fe95b7431
Author: Joseph K. Bradley <[email protected]>
Date: 2015-01-12T20:34:32Z
Prepped LDA main class for PR, but some cleanups remain
commit 724e2cff12671ed21ac7d719570732b5a7eca96a
Author: Joseph K. Bradley <[email protected]>
Date: 2015-01-13T19:32:12Z
cleanups before PR
commit 10bf4d6b2f10b2bd7bda1ec9eb270ee60ad9a6b8
Author: Joseph K. Bradley <[email protected]>
Date: 2015-01-14T00:45:06Z
separated LDA models into own file. more cleanups before PR
commit c6e430867ca32ca6f409f953a2d47dd04a1e6e53
Author: Joseph K. Bradley <[email protected]>
Date: 2015-01-14T18:17:20Z
Unit tests and fixes for LDA, now ready for PR
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]