[ https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14153412#comment-14153412 ]
David Hall edited comment on SPARK-1405 at 9/30/14 5:10 PM: ------------------------------------------------------------ Hi everyone, Sorry for taking so long for me to reply. As part of some contract work with Alpine, I've been working on yet another LDA implementation. We're actually implementing partially labeled lda[1], which is a strict generalization of LDA. The implementation is based on EM MAP inference, rather than Gibbs; EM has been shown to converge much more quickly (in number of iterations and wall time) and to better optima than Gibbs LDA[2]. It also has an interpretation when run in parallel. Collapsed Gibbs Sampling when run in parallel has no guarantees. EM is still guaranteed to converge to a local optimum. I'll post the code as soon as I clear it with Alpine. [1]http://nlp.stanford.edu/~dramage/papers/pldp-kdd11.pdf [2] http://mimno.infosci.cornell.edu/info6150/readings/UAI_09.pdf was (Author: dlwh): Hi everyone, Sorry for taking so long for me to reply. As part of some contract work with Alpine, I've been working on yet another LDA implementation. We're actually implementing partially labeled lda[1], which is a strict generalization of LDA. The implementation is based on EM MAP inference, rather than Gibbs; EM has been shown to converge much more quickly (in number of iterations and wall time) and to better optima than Gibbs LDA[2]. It also has an interpretation when run in parallel. Collapsed Gibbs Sampling when run in parallel has no guarantees. EM is still guaranteed to converged to a local optimum. I'll post the code as soon as I clear it with Alpine. [1]http://nlp.stanford.edu/~dramage/papers/pldp-kdd11.pdf [2] http://mimno.infosci.cornell.edu/info6150/readings/UAI_09.pdf > parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib > ----------------------------------------------------------------- > > Key: SPARK-1405 > URL: https://issues.apache.org/jira/browse/SPARK-1405 > Project: Spark > Issue Type: New Feature > Components: MLlib > Reporter: Xusen Yin > Assignee: Guoqiang Li > Labels: features > Attachments: performance_comparison.png > > Original Estimate: 336h > Remaining Estimate: 336h > > Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts > topics from text corpus. Different with current machine learning algorithms > in MLlib, instead of using optimization algorithms such as gradient desent, > LDA uses expectation algorithms such as Gibbs sampling. > In this PR, I prepare a LDA implementation based on Gibbs sampling, with a > wholeTextFiles API (solved yet), a word segmentation (import from Lucene), > and a Gibbs sampling core. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org