[
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14218907#comment-14218907
]
Pedro Rodriguez commented on SPARK-1405:
----------------------------------------
I am not super familiar with LSA, so hopefully this seems reasonable.
On LDA, we are doing performance tests and tuning appropriately using
partitions of wikipedia. Apart from that, I started working on a data generator
based on the LDA generative model. This should be helpful for generating
arbitrary parameter testing data (for example, it seems that PR2388 is using a
data set with many documents contains few words, while wiki is the opposite
situation).
So far our quality measure has been to run against graphlab LDA. Specifically,
we ran against the NIPS data set to make sure our convergence for the neg log
likelihood was reasonable. If thats helpful, I can give you that data set in a
format which is easier to parse than what is online.
> parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
> -----------------------------------------------------------------
>
> Key: SPARK-1405
> URL: https://issues.apache.org/jira/browse/SPARK-1405
> Project: Spark
> Issue Type: New Feature
> Components: MLlib
> Reporter: Xusen Yin
> Assignee: Guoqiang Li
> Priority: Critical
> Labels: features
> Attachments: performance_comparison.png
>
> Original Estimate: 336h
> Remaining Estimate: 336h
>
> Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts
> topics from text corpus. Different with current machine learning algorithms
> in MLlib, instead of using optimization algorithms such as gradient desent,
> LDA uses expectation algorithms such as Gibbs sampling.
> In this PR, I prepare a LDA implementation based on Gibbs sampling, with a
> wholeTextFiles API (solved yet), a word segmentation (import from Lucene),
> and a Gibbs sampling core.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]