GitHub user yinxusen opened a pull request:

    https://github.com/apache/spark/pull/476

    JIRA issue: [SPARK-1405](https://issues.apache.org/jira/browse/SPARK-1405) 
Gibbs sampling based Latent Dirichlet Allocation (LDA) for MLlib

    (This PR is based on a joint work done with @liancheng four months ago.)
    
    ## Overview
    
    LDA is a classical topic model in machine learning, that provides the 
ability to extract topics from corpus. Gibbs sampling (GS for short) is a 
common way to optimize LDA model.
    
    The LDA model consists of four matrices, two 1-dim matrices:
    
    * Document counts
    * Topic counts
    
    plus two 2-dim matrices:
    
    * Document-Topic counts
    * Topic-Term counts
    
    ## Implementation details
    
    * An accumulator is used to aggregate all updated values and applies them 
on the old model computed in the last iteration.
    
    * [Chalk](https://github.com/scalanlp/chalk) is used for term segmentation. 
Though it is easy to rewrite it with Lucene analyzers, I think MLlib should not 
take the burden to maintain an implementation of tokenizer.
    
    * `SparkContext.wholeTextFiles()` is convenient for offline 
experimentation, while `SparkContext.textFile()` is better for online 
applications.
    
    * Document dictionary and term dictionary are broadcasted to translate 
document names and terms into `Int` IDs.
    
    * Topic assignment matrix from the last iteration is cached for the current 
iteration, and then unpersisted to release memory.
    
    * LDA suffers similar stack overflow problem of MLlib ALS 
([SPARK-1006](https://spark-project.atlassian.net/browse/SPARK-1006)). To 
workaround this issue, we checkpoint every a few iterations.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/yinxusen/spark lda

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/476.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #476
    
----
commit 1f8793af562163e251d593c0f5118dea9176d7b5
Author: Xusen Yin <[email protected]>
Date:   2014-04-22T01:45:52Z

    initial commit

commit e137287f5cd73290fa558c07ededbeb091eda215
Author: Xusen Yin <[email protected]>
Date:   2014-04-22T01:48:52Z

    fix import style

commit 7378cff8eae01a726a1ea53b21db5f6972d6f14e
Author: Xusen Yin <[email protected]>
Date:   2014-04-22T01:52:30Z

    ready for PR

commit 063ff0fa607e956923dda32ebfcb4629583867d5
Author: Cheng Lian <[email protected]>
Date:   2014-04-22T02:32:36Z

    Code cleanup

commit 45b157edfa4e6444809daca9b0b2d57e2b575e4b
Author: Xusen Yin <[email protected]>
Date:   2014-04-22T05:06:56Z

    fix minor error

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to