[GitHub] spark pull request: [Spark-6793][MLlib] OnlineLDAOptimizer LDA per...

feynmanliang Tue, 28 Jul 2015 10:03:10 -0700

Github user feynmanliang commented on the pull request:

    https://github.com/apache/spark/pull/7705#issuecomment-125679733
  
    @hhbyyh I believe that log-perplexity is still used for generative NLP 
models (e.g. [Wallach 09](http://dirichlet.net/pdf/wallach09rethinking.pdf) 
Table 2 evaluates with log P(Wtest | W, Z, â¦) / Ntest). Is your concern about 
more efficient ways to compute it?
    
    Since perplexity is a measure of loss between the empirical and variational 
distributions, computing it will require inference of the variational 
parameters. One option could be to save the document-topic Dirichlet parameters 
`gammad` during training 
([gensim](https://github.com/piskvorky/gensim/blob/develop/gensim/models/ldamodel.py#L634)
 provides this as an optional arg to bound) but if we are concerned with 
scaling this may not be the best idea (I've left this as a tentative TODO, note 
that gensim itself does not utilize it).
    
    On the test set, however, we will have to perform inference before we can 
calculate perplexity. This is because the LDA model is only learning 
document-topic (`alpha`) and topic-word (`lambda`) Dirichlet parameters while 
computation of log perplexity requires the variational distribution 
(parameterized by `gamma`, which is obtained by inference).



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [Spark-6793][MLlib] OnlineLDAOptimizer LDA per...

Reply via email to