[ https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14271717#comment-14271717 ]
Pedro Rodriguez commented on SPARK-1405: ---------------------------------------- Second on nice design doc and proposal. I agree that having an API design to satisfy with implementations would work well and allow for different algorithms. I am meeting with Evan later today and probably will talk about this and where we are on the LDA implementation I have been working on. I am currently getting some performance benchmarks related to scaling with respect to number of topics, which should be super-linear using the algorithm in this paper: http://www.ics.uci.edu/~newman/pubs/fastlda.pdf Overall status I think, is that the scaling is good, but it would be nice to improve the constant multiplier of work if possible. Beyond that, perhaps its time to start running larger scale performance tests on ec2 clusters. If all goes well, I am hoping to refactor to satisfy the API proposal and open a PR. > parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib > ----------------------------------------------------------------- > > Key: SPARK-1405 > URL: https://issues.apache.org/jira/browse/SPARK-1405 > Project: Spark > Issue Type: New Feature > Components: MLlib > Reporter: Xusen Yin > Assignee: Guoqiang Li > Priority: Critical > Labels: features > Attachments: performance_comparison.png > > Original Estimate: 336h > Remaining Estimate: 336h > > Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts > topics from text corpus. Different with current machine learning algorithms > in MLlib, instead of using optimization algorithms such as gradient desent, > LDA uses expectation algorithms such as Gibbs sampling. > In this PR, I prepare a LDA implementation based on Gibbs sampling, with a > wholeTextFiles API (solved yet), a word segmentation (import from Lucene), > and a Gibbs sampling core. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org