[
https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14308603#comment-14308603
]
Pedro Rodriguez commented on SPARK-5556:
----------------------------------------
Posting here as a status update. I will be working on and opening a pull
request for adding a collapsed Gibbs sampling version which uses FastLDA for
super linear scaling with number of topics. Below is the design document (same
as from the original LDA JIRA issue), along with the repository/branch I am
working on.
https://docs.google.com/document/d/13MfroPXEEGKgaQaZlHkg1wdJMtCN5d8aHJuVkiOrOK4/edit?usp=sharing
https://github.com/EntilZha/spark/tree/LDA-Refactor
Tasks
* Rebase from the merged implementation, refactor appropriately
* Merge/implement the required inheritance/trait/abstract classes to support
two implementations (EM and Gibbs) using only the entry points exposed in the
EM version, plus an optional argument to select between EM/Gibbs.
* Do performance tests comparable to those run for EM LDA.
Some details for inheritance/trait/abstract:
General idea would be to create an API which LDA implementations must satisfy
using a trait/abstract class. All implementation details would be encapsulated
within a state object satisfying the trait/abstract class. LDA would be
responsible for creating an EM or Gibbs state object based on a user argument
switch/flag. Linked below is a sample implementation based on an earlier
version of the merged EM code (which needs to be updated to reflect the changes
since then, but it should show the idea well enough):
https://github.com/EntilZha/spark/blob/LDA-Refactor/mllib/src/main/scala/org/apache/spark/mllib/topicmodeling/LDA.scala#L216-L242
Timeline: I have been busier than expected, but rebase/refactoring should be
done in the next few days, then I will open a PR to get feedback while running
performance tests.
> Latent Dirichlet Allocation (LDA) using Gibbs sampler
> ------------------------------------------------------
>
> Key: SPARK-5556
> URL: https://issues.apache.org/jira/browse/SPARK-5556
> Project: Spark
> Issue Type: New Feature
> Components: MLlib
> Reporter: Guoqiang Li
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]