Jencir Lee created SPARK-18599:
----------------------------------

             Summary: Add the Spectral LDA algorithm which applies Tensor 
Decomposition to the LDA modelling. The Spectral LDA is a new breed of 
algorithm purely linear with guaranteed convergence by theory.
                 Key: SPARK-18599
                 URL: https://issues.apache.org/jira/browse/SPARK-18599
             Project: Spark
          Issue Type: New Feature
          Components: MLlib
            Reporter: Jencir Lee


The Spectral LDA algorithm transforms the LDA problem to an orthogonal tensor 
decomposition problem. [[Anandkumar 2012]] establishes theoretical guarantee 
for the convergence of orthogonal tensor decomposition. 

This algorithm first builds 2nd-order, 3rd-order moments from the empirical 
word counts, orthogonalize them and finally perform the tensor decomposition on 
the empirical data moments. The whole procedure is purely linear and could 
leverage machine native BLAS/LAPACK libraries (the Spark needs to be compiled 
with `-Pnetlib-lgpl` option).

It achieves competitive log-perplexity vs Online Variational Inference in the 
shortest time. It also has clean memory usage -- as of v2.0.0 we've experienced 
crash due to memory problem with the built-in Gibbs Sampler or Online 
Variational Inference, but never with the Spectral LDA algorithm. This 
algorithm is linearly scalable. 

The original repo is at https://github.com/FurongHuang/SpectralLDA-TensorSpark. 
We refactored for the Spark coding style and interfaces when porting over for 
the PR. We wrote a report describing the algorithm in detail and listing test 
results at https://www.overleaf.com/read/wscdvwrjmtmw. It's going to enter our 
official repo soon.

REFERENCES
Anandkumar, Anima, et al., Tensor decompositions for learning latent variable 
models, 2012, https://arxiv.org/abs/1210.7559.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to