[ 
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14271717#comment-14271717
 ] 

Pedro Rodriguez commented on SPARK-1405:
----------------------------------------

Second on nice design doc and proposal. I agree that having an API design to 
satisfy with implementations would work well and allow for different algorithms.

I am meeting with Evan later today and probably will talk about this and where 
we are on the LDA implementation I have been working on. I am currently getting 
some performance benchmarks related to scaling with respect to number of 
topics, which should be super-linear using the algorithm in this paper:
http://www.ics.uci.edu/~newman/pubs/fastlda.pdf

Overall status I think, is that the scaling is good, but it would be nice to 
improve the constant multiplier of work if possible. Beyond that, perhaps its 
time to start running larger scale performance tests on ec2 clusters. If all 
goes well, I am hoping to refactor to satisfy the API proposal and open a PR.

> parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
> -----------------------------------------------------------------
>
>                 Key: SPARK-1405
>                 URL: https://issues.apache.org/jira/browse/SPARK-1405
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>            Reporter: Xusen Yin
>            Assignee: Guoqiang Li
>            Priority: Critical
>              Labels: features
>         Attachments: performance_comparison.png
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts 
> topics from text corpus. Different with current machine learning algorithms 
> in MLlib, instead of using optimization algorithms such as gradient desent, 
> LDA uses expectation algorithms such as Gibbs sampling. 
> In this PR, I prepare a LDA implementation based on Gibbs sampling, with a 
> wholeTextFiles API (solved yet), a word segmentation (import from Lucene), 
> and a Gibbs sampling core.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to