GitHub user hhbyyh opened a pull request:

    https://github.com/apache/spark/pull/4419

    [SPARK-5563][mllib] online lda initial checkin

    JIRA: https://issues.apache.org/jira/browse/SPARK-5563
    The PR contains the implementation for [Online LDA] 
(https://www.cs.princeton.edu/~blei/papers/HoffmanBleiBach2010b.pdf) based on 
the research of  Matt Hoffman and David M. Blei, which provides an efficient 
option for LDA users. Major advantages for the algorithm are the stream 
compatibility and economic time/memory consumption due to the corpus split.
    For more details, please refer to the jira.
    
    For reviewers:
    
    1. I did some minor change on the return type of `LDA.run` ( change from 
`DistributedLDAModel` to `LDAModel`), since `DistributedLDAModel`  is based on 
graph.
    
    2. Current interface of `LDA.run` is actually not efficient for Online 
algorithm. For online LDA, it can perform the doc2vec in each mini-batch and 
don't need to hold the corpus in the memory.
    
    3. Currently I use `RDD.randomSplit` to do a horizontal split for the 
corpus, which downgrades performance (more than 10X slower). Any more proper 
way to do that?
    
    Performance and result comparison with current EM implementation :
    test data set is repetition of the 6 documents for 100 times:
      apple banana
      apple orange
      orange banana
      tiger cat
      cat dog
      tiger dog
    600 documents and 1200 tokens in total, vocab size is 6, very easy to repro 
in your PC.
    
     EM implementation: lda.run(corpus), 30 iterations, automatic parameters
    
    >   Corpus summary:
    >    Training set size: 600 documents
    >    Vocabulary size: 6 terms
    >    Training set size: 1200 tokens
    >    Preprocessing time: 3.965440641 sec
    >    
    >    Finished training LDA model.  Summary:
    >    Training time: 395.830773969 sec
    >    2 topics:
    >    TOPIC 0
    >    banana 0.18063733648044106
    >    dog    0.17613878600129707
    >    apple  0.1696818853021358
    >    orange 0.1646544894546831
    >    tiger  0.15561684070970766
    >   cat     0.15327066205173534
    >
    >    TOPIC 1
    >    cat    0.18006269828910626
    >    tiger  0.17771651490103385
    >    orange 0.1686788479353741
    >    apple  0.16365144195225495
    >    dog    0.1571945282354216
    >    banana 0.15269596868680924
    
    
    Online LDA: run(corpus, lda.LDAMode.Online)
    
    >   Corpus summary:
    >    Training set size: 600 documents
    >    Vocabulary size: 6 terms
    >    Training set size: 1200 tokens
    >    Preprocessing time: 4.035652719 sec
    >
    >   Finished training LDA model.  Summary:
    >    Training time: 15.72914271 sec
    >
    >   2 topics:
    >   TOPIC 0
    >   apple   0.34047846308724955
    >   banana  0.3389019755911641
    >   orange  0.31774004408135487
    >   cat             9.773363079267432E-4
    >   dog     9.552721891145982E-4
    >   tiger   9.469087431901764E-4
    >
    >   TOPIC 1
    >   cat     0.3519694583370116
    >   tiger   0.3353643872639939
    >   dog     0.30993905428237273
    >   banana  9.528473706557286E-4
    >   apple   8.999766283570917E-4
    >   orange  8.742761176089969E-4
    
    Online version is faster and with better results due to the the algorithm 
essense (Thanks to Matt Hoffman and David M. Blei)
    The version from https://github.com/hhbyyh/OnlineLDA_Spark is even faster 
than 2, mainly because it avoids the randomSplit. For the same input, it uses 
less than 3 seconds including sparkcontext initialization and stop.
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/hhbyyh/spark ldaonline

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/4419.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #4419
    
----
commit d640d9c58cd4f3caa6eac462b947b3a891dabbda
Author: Yuhao Yang <[email protected]>
Date:   2015-02-06T03:12:49Z

    online lda initial checkin

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to