[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib

Joseph K. Bradley (JIRA) Thu, 08 Jan 2015 11:39:41 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14269933#comment-14269933
 ]


Joseph K. Bradley commented on SPARK-1405:
------------------------------------------

Hi all, there are several possible Spark LDA implementations out there (in PRs 
or public Github repos), and I believe the best thing to do is to:
* settle on a simple API + implementation to start with
* switch existing PRs which use alternate algorithms (EM, Gibbs sampling, 
variational EM, etc.) to use the same interface, where the inference algorithm 
can be set via a parameter

Towards this goal, I've written [this design 
doc](https://docs.google.com/document/d/1kSsDqTeZMEB94Bs4GTd0mvdAmduvZSSkpoSfn-seAzo/edit?usp=sharing)
 which focuses on the API rather than algorithm design.  I'm also preparing a 
PR based on the simplest implementation I have been able to find, written by 
[~dlwh].  I should be able to submit it in a day or so.  It uses 
(non-variational) EM, which should be fast albeit maybe not as accurate as 
Gibbs sampling.

I'd of course appreciate feedback on the design doc, as well as the actual PR.  
It will be great to settle on a public API which can satisfy the many existing 
implementations of LDA in Spark.

When we merge the initial LDA PR, [~mengxr] will be sure to include all of 
those who have participated as authors of Spark LDA PRs: [~akopich], [~witgo], 
[~yinxusen], [~dlwh], Pedro, [~jegonzal]


> parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
> -----------------------------------------------------------------
>
>                 Key: SPARK-1405
>                 URL: https://issues.apache.org/jira/browse/SPARK-1405
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>            Reporter: Xusen Yin
>            Assignee: Guoqiang Li
>            Priority: Critical
>              Labels: features
>         Attachments: performance_comparison.png
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts 
> topics from text corpus. Different with current machine learning algorithms 
> in MLlib, instead of using optimization algorithms such as gradient desent, 
> LDA uses expectation algorithms such as Gibbs sampling. 
> In this PR, I prepare a LDA implementation based on Gibbs sampling, with a 
> wholeTextFiles API (solved yet), a word segmentation (import from Lucene), 
> and a Gibbs sampling core.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib

Reply via email to