Hi Markus, I tried your code and find the issue might be there are only 18 docs in the Gutenberg corpus. if you print out transformed doc topic distribution, you will see a lot of topics are not used. And since there is no words assigned to those topics, the weights will be equal to`topic_word_prior` parameter.
You can print out the transformed doc topic distributions like this: ------------- >>> doc_distr = lda.fit_transform(tf) >>> for d in doc_distr: ... print np.where(d > 0.001)[0] ... [17 27] [17 27] [17 27 28] [14] [ 2 4 28] [ 2 4 15 21 27 28] [1] [ 1 2 17 21 27 28] [ 2 15 17 22 28] [ 2 17 21 22 27 28] [ 2 15 17 28] [ 2 17 21 27 28] [ 2 14 15 17 21 22 27 28] [15 22] [ 8 11] [8] [ 8 24] [ 2 14 15 22] and my full test scripts are here: https://gist.github.com/chyikwei/1707b59e009d381e1ce1e7a38f9c7826 Best, Chyi-Kwei On Thu, Sep 14, 2017 at 7:26 AM Markus Konrad <markus.kon...@wzb.eu> wrote: > Hi there, > > I'm trying out sklearn's latent Dirichlet allocation implementation for > topic modeling. The code from the official example [1] works just fine and > the extracted topics look reasonable. However, when I try other corpora, > for example the Gutenberg corpus from NLTK, most of the extracted topics > are garbage. See this example output, when trying to get 30 topics: > > Topic #0: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues (0.01) > fatiguing (0.01) > Topic #1: mr (1081.61) emma (866.01) miss (506.94) mrs (445.56) jane > (301.83) > Topic #2: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues (0.01) > fatiguing (0.01) > Topic #3: thee (82.64) thou (70.0) thy (66.66) father (56.45) mother > (55.27) > Topic #4: anne (498.74) captain (303.01) lady (173.96) mr (172.07) charles > (166.21) > Topic #5: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues (0.01) > fatiguing (0.01) > Topic #6: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues (0.01) > fatiguing (0.01) > Topic #7: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues (0.01) > fatiguing (0.01) > ... > > Many topics tend to have the same weights, all equal to the > `topic_word_prior` parameter. > > This is my script: > > import nltk > from sklearn.feature_extraction.text import CountVectorizer > from sklearn.decomposition import LatentDirichletAllocation > > def print_top_words(model, feature_names, n_top_words): > for topic_idx, topic in enumerate(model.components_): > message = "Topic #%d: " % topic_idx > message += " ".join([feature_names[i] + " (" + str(round(topic[i], > 2)) + ")" > for i in topic.argsort()[:-n_top_words - > 1:-1]]) > print(message) > > > data_samples = [nltk.corpus.gutenberg.raw(f_id) > for f_id in nltk.corpus.gutenberg.fileids()] > > tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, > stop_words='english') > tf = tf_vectorizer.fit_transform(data_samples) > > lda = LatentDirichletAllocation(n_components=30, > learning_method='batch', > n_jobs=-1, # all CPUs > verbose=1, > evaluate_every=10, > max_iter=1000, > doc_topic_prior=0.1, > topic_word_prior=0.01, > random_state=1) > > lda.fit(tf) > tf_feature_names = tf_vectorizer.get_feature_names() > print_top_words(lda, tf_feature_names, 5) > > Is there a problem in how I set up the LatentDirichletAllocation instance > or pass the data? I tried out different parameter settings, but none of > them provided good results for that corpus. I also tried out alternative > implementations (like the lda package [2]) and those were able to find > reasonable topics. > > Best, > Markus > > > [1] > http://scikit-learn.org/stable/auto_examples/applications/plot_topics_extraction_with_nmf_lda.html#sphx-glr-auto-examples-applications-plot-topics-extraction-with-nmf-lda-py > [2] http://pythonhosted.org/lda/ > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn >
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn