[jira] [Comment Edited] (SPARK-5564) Support sparse LDA solutions

Debasish Das (JIRA) Sun, 01 Mar 2015 08:20:37 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14342311#comment-14342311
 ]


Debasish Das edited comment on SPARK-5564 at 3/1/15 4:19 PM:
-------------------------------------------------------------

I am right now using the following PR to do large rank matrix factorization 
with various constraints...I am not sure if the current ALS will scale to large 
ranks but I am keen to compare the exact formulation in graphx based LDA flow...

https://github.com/scalanlp/breeze/pull/364

Idea here is to solve the constrained factorization problem as explained in 
Vorontsov and Potapenko:
minimize f(w,h*)
s.t 1'w = 1, w >=0 (row constraints)

minimize f(w*,h)
s.t 0 <= h <= 1, Normalize each column in h

Here I want f(w,h) to be MAP loss but I already solved the least square variant 
in https://issues.apache.org/jira/browse/SPARK-2426 and got good improvement in 
MAP statistics...Here also I expect Perplexity will improve...

If no one else is looking into it I would like to compare join based 
factorization based flow (ml.recommendation.ALS) with the graphx based LDA 
flow...

Infact if you think for large ranks, LDA based flow will be more efficient than 
join based factorization flow, I can implement stochastic matrix factorization 
directly on top of LDA and add both the least square and MAP losses...


was (Author: debasish83):
I am right now using the following PR to do large rank matrix factorization 
with various constraints...I am not sure if the current ALS will scale to large 
ranks but I will keen to compare the exact formulation in graphx based LDA 
flow...

https://github.com/scalanlp/breeze/pull/364

Idea here is to solve the constrained factorization problem as explained in 
Vorontsov and Potapenko:
minimize f(w,h*)
s.t 1'w = 1, w >=0 (row constraints)

minimize f(w*,h)
s.t 0 <= h <= 1, Normalize each column in h

Here I want f(w,h) to be MAP loss but I already solved the least square variant 
in https://issues.apache.org/jira/browse/SPARK-2426 and got good improvement in 
MAP statistics...Here also I expect Perplexity will improve...

If no one else is looking into it I would like to compare join based 
factorization based flow (ml.recommendation.ALS) with the graphx based LDA 
flow...

Infact if you think for large ranks, LDA based flow will be more efficient than 
join based factorization flow, I can implement stochastic matrix factorization 
directly on top of LDA and add both the least square and MAP losses...

> Support sparse LDA solutions
> ----------------------------
>
>                 Key: SPARK-5564
>                 URL: https://issues.apache.org/jira/browse/SPARK-5564
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>    Affects Versions: 1.3.0
>            Reporter: Joseph K. Bradley
>
> Latent Dirichlet Allocation (LDA) currently requires that the priors’ 
> concentration parameters be > 1.0.  It should support values > 0.0, which 
> should encourage sparser topics (phi) and document-topic distributions 
> (theta).
> For EM, this will require adding a projection to the M-step, as in: Vorontsov 
> and Potapenko. "Tutorial on Probabilistic Topic Modeling : Additive 
> Regularization for Stochastic Matrix Factorization." 2014.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-5564) Support sparse LDA solutions

Reply via email to