GitHub user yinxusen opened a pull request:
https://github.com/apache/spark/pull/476
JIRA issue: [SPARK-1405](https://issues.apache.org/jira/browse/SPARK-1405)
Gibbs sampling based Latent Dirichlet Allocation (LDA) for MLlib
(This PR is based on a joint work done with @liancheng four months ago.)
## Overview
LDA is a classical topic model in machine learning, that provides the
ability to extract topics from corpus. Gibbs sampling (GS for short) is a
common way to optimize LDA model.
The LDA model consists of four matrices, two 1-dim matrices:
* Document counts
* Topic counts
plus two 2-dim matrices:
* Document-Topic counts
* Topic-Term counts
## Implementation details
* An accumulator is used to aggregate all updated values and applies them
on the old model computed in the last iteration.
* [Chalk](https://github.com/scalanlp/chalk) is used for term segmentation.
Though it is easy to rewrite it with Lucene analyzers, I think MLlib should not
take the burden to maintain an implementation of tokenizer.
* `SparkContext.wholeTextFiles()` is convenient for offline
experimentation, while `SparkContext.textFile()` is better for online
applications.
* Document dictionary and term dictionary are broadcasted to translate
document names and terms into `Int` IDs.
* Topic assignment matrix from the last iteration is cached for the current
iteration, and then unpersisted to release memory.
* LDA suffers similar stack overflow problem of MLlib ALS
([SPARK-1006](https://spark-project.atlassian.net/browse/SPARK-1006)). To
workaround this issue, we checkpoint every a few iterations.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/yinxusen/spark lda
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/476.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #476
----
commit 1f8793af562163e251d593c0f5118dea9176d7b5
Author: Xusen Yin <[email protected]>
Date: 2014-04-22T01:45:52Z
initial commit
commit e137287f5cd73290fa558c07ededbeb091eda215
Author: Xusen Yin <[email protected]>
Date: 2014-04-22T01:48:52Z
fix import style
commit 7378cff8eae01a726a1ea53b21db5f6972d6f14e
Author: Xusen Yin <[email protected]>
Date: 2014-04-22T01:52:30Z
ready for PR
commit 063ff0fa607e956923dda32ebfcb4629583867d5
Author: Cheng Lian <[email protected]>
Date: 2014-04-22T02:32:36Z
Code cleanup
commit 45b157edfa4e6444809daca9b0b2d57e2b575e4b
Author: Xusen Yin <[email protected]>
Date: 2014-04-22T05:06:56Z
fix minor error
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---