Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/4047#issuecomment-72348800
*Update on tests*
Summary:
* On a small dataset (20 newsgroups), it seems to work fine (on my laptop).
* On a big dataset (Wikipedia dump with close to 1 billion tokens), it's
been hard to get it to run for more than 10 or 20 iterations (on a 16-node EC2
cluster).
Details:
Small dataset: You can see the output here:
[https://github.com/jkbradley/spark/blob/lda-tmp/20news.lda.out]. The log
likelihood improves with each iteration, and iteration running times stay about
the same throughout training. The topics are really nicely divided among the
newsgroups. (But I did run this using 20 topics.) I used 100 iterations and
the stopwords mentioned above.
Large dataset: Even with checkpointing, it has been hard to run for many
iterations, mainly because of shuffle files and checkpoint files building up.
I need to spend some more time running tests. Currently, the results on the
Wikipedia dump do not look good; topics are pretty much all the same. It is
unclear if this is because of poor convergence, a need for parameter tuning, a
need for supporting sparsity as mentioned above (which might help to force
topics to differentiate), or a need for better initialization (since EM can
have lots of trouble with LDA's many local minima).
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]