Hi there,
I'm trying out sklearn's latent Dirichlet allocation implementation for topic
modeling. The code from the official example [1] works just fine and the
extracted topics look reasonable. However, when I try other corpora, for
example the Gutenberg corpus from NLTK, most of the extracted topics are
garbage. See this example output, when trying to get 30 topics:
Topic #0: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues (0.01)
fatiguing (0.01)
Topic #1: mr (1081.61) emma (866.01) miss (506.94) mrs (445.56) jane (301.83)
Topic #2: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues (0.01)
fatiguing (0.01)
Topic #3: thee (82.64) thou (70.0) thy (66.66) father (56.45) mother (55.27)
Topic #4: anne (498.74) captain (303.01) lady (173.96) mr (172.07) charles
(166.21)
Topic #5: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues (0.01)
fatiguing (0.01)
Topic #6: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues (0.01)
fatiguing (0.01)
Topic #7: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues (0.01)
fatiguing (0.01)
...
Many topics tend to have the same weights, all equal to the `topic_word_prior`
parameter.
This is my script:
import nltk
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
def print_top_words(model, feature_names, n_top_words):
for topic_idx, topic in enumerate(model.components_):
message = "Topic #%d: " % topic_idx
message += " ".join([feature_names[i] + " (" + str(round(topic[i], 2))
+ ")"
for i in topic.argsort()[:-n_top_words - 1:-1]])
print(message)
data_samples = [nltk.corpus.gutenberg.raw(f_id)
for f_id in nltk.corpus.gutenberg.fileids()]
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2,
stop_words='english')
tf = tf_vectorizer.fit_transform(data_samples)
lda = LatentDirichletAllocation(n_components=30,
learning_method='batch',
n_jobs=-1, # all CPUs
verbose=1,
evaluate_every=10,
max_iter=1000,
doc_topic_prior=0.1,
topic_word_prior=0.01,
random_state=1)
lda.fit(tf)
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, 5)
Is there a problem in how I set up the LatentDirichletAllocation instance or
pass the data? I tried out different parameter settings, but none of them
provided good results for that corpus. I also tried out alternative
implementations (like the lda package [2]) and those were able to find
reasonable topics.
Best,
Markus
[1]
http://scikit-learn.org/stable/auto_examples/applications/plot_topics_extraction_with_nmf_lda.html#sphx-glr-auto-examples-applications-plot-topics-extraction-with-nmf-lda-py
[2] http://pythonhosted.org/lda/
_______________________________________________
scikit-learn mailing list
[email protected]
https://mail.python.org/mailman/listinfo/scikit-learn