[GitHub] spark pull request: [SPARK-1405] [mllib] Latent Dirichlet Allocati...

jkbradley Sat, 31 Jan 2015 18:59:30 -0800

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/4047#issuecomment-72348800
  
    *Update on tests*
    
    Summary:
    * On a small dataset (20 newsgroups), it seems to work fine (on my laptop).
    * On a big dataset (Wikipedia dump with close to 1 billion tokens), it's 
been hard to get it to run for more than 10 or 20 iterations (on a 16-node EC2 
cluster).
    
    Details:
    
    Small dataset: You can see the output here: 
[https://github.com/jkbradley/spark/blob/lda-tmp/20news.lda.out].  The log 
likelihood improves with each iteration, and iteration running times stay about 
the same throughout training.  The topics are really nicely divided among the 
newsgroups.  (But I did run this using 20 topics.)  I used 100 iterations and 
the stopwords mentioned above.
    
    Large dataset: Even with checkpointing, it has been hard to run for many 
iterations, mainly because of shuffle files and checkpoint files building up.  
I need to spend some more time running tests.  Currently, the results on the 
Wikipedia dump do not look good; topics are pretty much all the same.  It is 
unclear if this is because of poor convergence, a need for parameter tuning, a 
need for supporting sparsity as mentioned above (which might help to force 
topics to differentiate), or a need for better initialization (since EM can 
have lots of trouble with LDA's many local minima).



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-1405] [mllib] Latent Dirichlet Allocati...

Reply via email to