Hi there, I'm trying out sklearn's latent Dirichlet allocation implementation for topic modeling. The code from the official example [1] works just fine and the extracted topics look reasonable. However, when I try other corpora, for example the Gutenberg corpus from NLTK, most of the extracted topics are garbage. See this example output, when trying to get 30 topics:
Topic #0: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues (0.01) fatiguing (0.01) Topic #1: mr (1081.61) emma (866.01) miss (506.94) mrs (445.56) jane (301.83) Topic #2: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues (0.01) fatiguing (0.01) Topic #3: thee (82.64) thou (70.0) thy (66.66) father (56.45) mother (55.27) Topic #4: anne (498.74) captain (303.01) lady (173.96) mr (172.07) charles (166.21) Topic #5: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues (0.01) fatiguing (0.01) Topic #6: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues (0.01) fatiguing (0.01) Topic #7: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues (0.01) fatiguing (0.01) ... Many topics tend to have the same weights, all equal to the `topic_word_prior` parameter. This is my script: import nltk from sklearn.feature_extraction.text import CountVectorizer from sklearn.decomposition import LatentDirichletAllocation def print_top_words(model, feature_names, n_top_words): for topic_idx, topic in enumerate(model.components_): message = "Topic #%d: " % topic_idx message += " ".join([feature_names[i] + " (" + str(round(topic[i], 2)) + ")" for i in topic.argsort()[:-n_top_words - 1:-1]]) print(message) data_samples = [nltk.corpus.gutenberg.raw(f_id) for f_id in nltk.corpus.gutenberg.fileids()] tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, stop_words='english') tf = tf_vectorizer.fit_transform(data_samples) lda = LatentDirichletAllocation(n_components=30, learning_method='batch', n_jobs=-1, # all CPUs verbose=1, evaluate_every=10, max_iter=1000, doc_topic_prior=0.1, topic_word_prior=0.01, random_state=1) lda.fit(tf) tf_feature_names = tf_vectorizer.get_feature_names() print_top_words(lda, tf_feature_names, 5) Is there a problem in how I set up the LatentDirichletAllocation instance or pass the data? I tried out different parameter settings, but none of them provided good results for that corpus. I also tried out alternative implementations (like the lda package [2]) and those were able to find reasonable topics. Best, Markus [1] http://scikit-learn.org/stable/auto_examples/applications/plot_topics_extraction_with_nmf_lda.html#sphx-glr-auto-examples-applications-plot-topics-extraction-with-nmf-lda-py [2] http://pythonhosted.org/lda/ _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn