Jencir Lee created SPARK-18599:
----------------------------------
Summary: Add the Spectral LDA algorithm which applies Tensor
Decomposition to the LDA modelling. The Spectral LDA is a new breed of
algorithm purely linear with guaranteed convergence by theory.
Key: SPARK-18599
URL: https://issues.apache.org/jira/browse/SPARK-18599
Project: Spark
Issue Type: New Feature
Components: MLlib
Reporter: Jencir Lee
The Spectral LDA algorithm transforms the LDA problem to an orthogonal tensor
decomposition problem. [[Anandkumar 2012]] establishes theoretical guarantee
for the convergence of orthogonal tensor decomposition.
This algorithm first builds 2nd-order, 3rd-order moments from the empirical
word counts, orthogonalize them and finally perform the tensor decomposition on
the empirical data moments. The whole procedure is purely linear and could
leverage machine native BLAS/LAPACK libraries (the Spark needs to be compiled
with `-Pnetlib-lgpl` option).
It achieves competitive log-perplexity vs Online Variational Inference in the
shortest time. It also has clean memory usage -- as of v2.0.0 we've experienced
crash due to memory problem with the built-in Gibbs Sampler or Online
Variational Inference, but never with the Spectral LDA algorithm. This
algorithm is linearly scalable.
The original repo is at https://github.com/FurongHuang/SpectralLDA-TensorSpark.
We refactored for the Spark coding style and interfaces when porting over for
the PR. We wrote a report describing the algorithm in detail and listing test
results at https://www.overleaf.com/read/wscdvwrjmtmw. It's going to enter our
official repo soon.
REFERENCES
Anandkumar, Anima, et al., Tensor decompositions for learning latent variable
models, 2012, https://arxiv.org/abs/1210.7559.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]