Hi Markus, I find that in current LDA implementation we included "E[log p(beta | eta) - log q (beta | lambda)]" in the approx bound function and use it to calculate perplexity. But this part was not included in the likelihood function in Blei's C implementation.
Maybe this caused some difference. (I am not sure which one is correct. will need some time to compare the difference.) Best, Chyi-Kwei reference code: sklearn https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/decomposition/online_lda.py#L707-L709 original onlineldavb https://github.com/blei-lab/onlineldavb/blob/master/onlineldavb.py#L384-L388 Blei's C implementation https://github.com/blei-lab/lda-c/blob/master/lda-inference.c#L94-L127 On Wed, Oct 4, 2017 at 7:56 AM Markus Konrad <markus.kon...@wzb.eu> wrote: > Hi there, > > I'm trying to find the optimal number of topics for Topic Modeling with > Latent Dirichlet Allocation. I implemented a 5-fold cross validation > method similar to the one described and implemented in R here [1]. I > basically split the full data into 5 equal sized chunks. Then for each > fold (`cur_fold`), 4 of 5 chunks are used for training and 1 for > validation using the `perplexity()` method on the held-out data set: > > ``` > dtm_train = data[split_folds != cur_fold, :] > dtm_valid = data[split_folds == cur_fold, :] > > lda_instance = LatentDirichletAllocation(**params) > lda_instance.fit(dtm_train) > > perpl = lda_instance.perplexity(dtm_valid) > ``` > > This is done for a set of parameters, basically for a varying number of > topics (n_components). > > I tried this out with a number of different data sets, for example with > the "Associated Press" data mentioned in [1], which is the sample data > for David M. Blei's LDA C implementation [2]. > Using the same data, I would expect that I get similar results as in > [1], which found that a model with ~100 topics fits the AP data best. > However, my experiments always show that the perplexity is exponentially > growing with the number of topics. The "best" model is always the one > with the lowest number of topics. The same happens with other data sets, > too. Similar results happen when calculating the perplexity on the full > training data alone (so no cross validation on held-out data). > > Does anyone have an idea why these results are not consistent with those > from [1]? Is the perplexity() method not the correct method to use when > evaluating held-out data? Could it be a problem, that some of the > columns of the training data term frequency matrix are all-zero? > > Best, > Markus > > > [1] http://ellisp.github.io/blog/2017/01/05/topic-model-cv > [2] > > https://web.archive.org/web/20160930175144/http://www.cs.princeton.edu/~blei/lda-c/index.html > > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn >
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn