[
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14218941#comment-14218941
]
Debasish Das commented on SPARK-1405:
-------------------------------------
For LSA you can find references on the PR. Microsoft paper ran LSA on wiki
dataset but they compared map and ndcg measures...
NIPS and wiki datasets both will help....I was thinking about wiki....could you
please add the NIPS dataset as well and reference for it ?
I will look into graphlab lda and what measures they are running...
> parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
> -----------------------------------------------------------------
>
> Key: SPARK-1405
> URL: https://issues.apache.org/jira/browse/SPARK-1405
> Project: Spark
> Issue Type: New Feature
> Components: MLlib
> Reporter: Xusen Yin
> Assignee: Guoqiang Li
> Priority: Critical
> Labels: features
> Attachments: performance_comparison.png
>
> Original Estimate: 336h
> Remaining Estimate: 336h
>
> Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts
> topics from text corpus. Different with current machine learning algorithms
> in MLlib, instead of using optimization algorithms such as gradient desent,
> LDA uses expectation algorithms such as Gibbs sampling.
> In this PR, I prepare a LDA implementation based on Gibbs sampling, with a
> wholeTextFiles API (solved yet), a word segmentation (import from Lucene),
> and a Gibbs sampling core.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]