[jira] [Comment Edited] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib

Pedro Rodriguez (JIRA) Mon, 24 Nov 2014 18:19:31 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223922#comment-14223922
 ]


Pedro Rodriguez edited comment on SPARK-1405 at 11/25/14 2:18 AM:
------------------------------------------------------------------

Finished an initial implementation of an LDA data generator. I have done some 
initial testing and it seems reasonable, but just initial testing at the 
moment. Will be looking at metrics other than "it looks good" to make sure that 
the data being generated is correct.

Implementation: 
https://github.com/EntilZha/spark/blob/LDA/mllib/src/main/scala/org/apache/spark/mllib/util/LDADataGenerator.scala


was (Author: pedrorodriguez):
Finished an initial implementation of an LDA data generator. I have done some 
initial testing and it seems reasonable, but just initial testing at the 
moment. Will be looking at metrics other than "it looks good" to make sure that 
the data being generated looks reasonable.

Implementation: 
https://github.com/EntilZha/spark/blob/LDA/mllib/src/main/scala/org/apache/spark/mllib/util/LDADataGenerator.scala

> parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
> -----------------------------------------------------------------
>
>                 Key: SPARK-1405
>                 URL: https://issues.apache.org/jira/browse/SPARK-1405
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>            Reporter: Xusen Yin
>            Assignee: Guoqiang Li
>            Priority: Critical
>              Labels: features
>         Attachments: performance_comparison.png
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts 
> topics from text corpus. Different with current machine learning algorithms 
> in MLlib, instead of using optimization algorithms such as gradient desent, 
> LDA uses expectation algorithms such as Gibbs sampling. 
> In this PR, I prepare a LDA implementation based on Gibbs sampling, with a 
> wholeTextFiles API (solved yet), a word segmentation (import from Lucene), 
> and a Gibbs sampling core.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib

Reply via email to