Github user feynmanliang commented on the pull request:
https://github.com/apache/spark/pull/7705#issuecomment-125679733
@hhbyyh I believe that log-perplexity is still used for generative NLP
models (e.g. [Wallach 09](http://dirichlet.net/pdf/wallach09rethinking.pdf)
Table 2 evaluates with log P(Wtest | W, Z, â¦) / Ntest). Is your concern about
more efficient ways to compute it?
Since perplexity is a measure of loss between the empirical and variational
distributions, computing it will require inference of the variational
parameters. One option could be to save the document-topic Dirichlet parameters
`gammad` during training
([gensim](https://github.com/piskvorky/gensim/blob/develop/gensim/models/ldamodel.py#L634)
provides this as an optional arg to bound) but if we are concerned with
scaling this may not be the best idea (I've left this as a tentative TODO, note
that gensim itself does not utilize it).
On the test set, however, we will have to perform inference before we can
calculate perplexity. This is because the LDA model is only learning
document-topic (`alpha`) and topic-word (`lambda`) Dirichlet parameters while
computation of log perplexity requires the variational distribution
(parameterized by `gamma`, which is obtained by inference).
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]