Hi there, I'm trying to find the optimal number of topics for Topic Modeling with Latent Dirichlet Allocation. I implemented a 5-fold cross validation method similar to the one described and implemented in R here [1]. I basically split the full data into 5 equal sized chunks. Then for each fold (`cur_fold`), 4 of 5 chunks are used for training and 1 for validation using the `perplexity()` method on the held-out data set:
``` dtm_train = data[split_folds != cur_fold, :] dtm_valid = data[split_folds == cur_fold, :] lda_instance = LatentDirichletAllocation(**params) lda_instance.fit(dtm_train) perpl = lda_instance.perplexity(dtm_valid) ``` This is done for a set of parameters, basically for a varying number of topics (n_components). I tried this out with a number of different data sets, for example with the "Associated Press" data mentioned in [1], which is the sample data for David M. Blei's LDA C implementation [2]. Using the same data, I would expect that I get similar results as in [1], which found that a model with ~100 topics fits the AP data best. However, my experiments always show that the perplexity is exponentially growing with the number of topics. The "best" model is always the one with the lowest number of topics. The same happens with other data sets, too. Similar results happen when calculating the perplexity on the full training data alone (so no cross validation on held-out data). Does anyone have an idea why these results are not consistent with those from [1]? Is the perplexity() method not the correct method to use when evaluating held-out data? Could it be a problem, that some of the columns of the training data term frequency matrix are all-zero? Best, Markus [1] http://ellisp.github.io/blog/2017/01/05/topic-model-cv [2] https://web.archive.org/web/20160930175144/http://www.cs.princeton.edu/~blei/lda-c/index.html _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn