GitHub user hhbyyh opened a pull request:
https://github.com/apache/spark/pull/4419
[SPARK-5563][mllib] online lda initial checkin
JIRA: https://issues.apache.org/jira/browse/SPARK-5563
The PR contains the implementation for [Online LDA]
(https://www.cs.princeton.edu/~blei/papers/HoffmanBleiBach2010b.pdf) based on
the research of Matt Hoffman and David M. Blei, which provides an efficient
option for LDA users. Major advantages for the algorithm are the stream
compatibility and economic time/memory consumption due to the corpus split.
For more details, please refer to the jira.
For reviewers:
1. I did some minor change on the return type of `LDA.run` ( change from
`DistributedLDAModel` to `LDAModel`), since `DistributedLDAModel` is based on
graph.
2. Current interface of `LDA.run` is actually not efficient for Online
algorithm. For online LDA, it can perform the doc2vec in each mini-batch and
don't need to hold the corpus in the memory.
3. Currently I use `RDD.randomSplit` to do a horizontal split for the
corpus, which downgrades performance (more than 10X slower). Any more proper
way to do that?
Performance and result comparison with current EM implementation :
test data set is repetition of the 6 documents for 100 times:
apple banana
apple orange
orange banana
tiger cat
cat dog
tiger dog
600 documents and 1200 tokens in total, vocab size is 6, very easy to repro
in your PC.
EM implementation: lda.run(corpus), 30 iterations, automatic parameters
> Corpus summary:
> Training set size: 600 documents
> Vocabulary size: 6 terms
> Training set size: 1200 tokens
> Preprocessing time: 3.965440641 sec
>
> Finished training LDA model. Summary:
> Training time: 395.830773969 sec
> 2 topics:
> TOPIC 0
> banana 0.18063733648044106
> dog 0.17613878600129707
> apple 0.1696818853021358
> orange 0.1646544894546831
> tiger 0.15561684070970766
> cat 0.15327066205173534
>
> TOPIC 1
> cat 0.18006269828910626
> tiger 0.17771651490103385
> orange 0.1686788479353741
> apple 0.16365144195225495
> dog 0.1571945282354216
> banana 0.15269596868680924
Online LDA: run(corpus, lda.LDAMode.Online)
> Corpus summary:
> Training set size: 600 documents
> Vocabulary size: 6 terms
> Training set size: 1200 tokens
> Preprocessing time: 4.035652719 sec
>
> Finished training LDA model. Summary:
> Training time: 15.72914271 sec
>
> 2 topics:
> TOPIC 0
> apple 0.34047846308724955
> banana 0.3389019755911641
> orange 0.31774004408135487
> cat 9.773363079267432E-4
> dog 9.552721891145982E-4
> tiger 9.469087431901764E-4
>
> TOPIC 1
> cat 0.3519694583370116
> tiger 0.3353643872639939
> dog 0.30993905428237273
> banana 9.528473706557286E-4
> apple 8.999766283570917E-4
> orange 8.742761176089969E-4
Online version is faster and with better results due to the the algorithm
essense (Thanks to Matt Hoffman and David M. Blei)
The version from https://github.com/hhbyyh/OnlineLDA_Spark is even faster
than 2, mainly because it avoids the randomSplit. For the same input, it uses
less than 3 seconds including sparkcontext initialization and stop.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/hhbyyh/spark ldaonline
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/4419.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #4419
----
commit d640d9c58cd4f3caa6eac462b947b3a891dabbda
Author: Yuhao Yang <[email protected]>
Date: 2015-02-06T03:12:49Z
online lda initial checkin
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]