This is indeed interesting. I didn't know that there are so big differences between these approaches. I split the 18 documents into sub-documents of 5 paragraphs each, so that I got around 10k of these sub-documents. Now, scikit-learn and gensim deliver much better results, quite similar to those from a Gibbs sampling based implementation. So it was basically the same data, just split in a different way.
I think the disadvantages/limits of the Variational Bayes approach should be mentioned in the documentation. Best, Markus On 09/18/2017 06:59 PM, Andreas Mueller wrote: > For very few documents, Gibbs sampling is likely to work better - or > rather, Gibbs sampling usually works > better given enough runtime, and for so few documents, runtime is not an > issue. > The length of the documents don't matter, only the size of the vocabulary. > Also, hyper parameter choices might need to be different for Gibbs > sampling vs variational inference. > > On 09/18/2017 12:26 PM, Markus Konrad wrote: >> Hi Chyi-Kwei, >> >> thanks for digging into this. I made similar observations with Gensim >> when using only a small number of (big) documents. Gensim also uses the >> Online Variational Bayes approach (Hoffman et al.). So could it be that >> the Hoffman et al. method is problematic in such scenarios? I found that >> Gibbs sampling based implementations provide much more informative >> topics in this case. >> >> If this was the case, then if I'd slice the documents in some way (say >> every N paragraphs become a "document") then I should get better results >> with scikit-learn and Gensim, right? I think I'll try this out tomorrow. >> >> Best, >> Markus >> >> >> >>> Date: Sun, 17 Sep 2017 23:52:51 +0000 >>> From: chyi-kwei yau <chyikwei....@gmail.com> >>> To: Scikit-learn mailing list <scikit-learn@python.org> >>> Subject: Re: [scikit-learn] LatentDirichletAllocation failing to find >>> topics in NLTK Gutenberg corpus? >>> Message-ID: >>> <cak-jh0ygd8fsdjom+gddohvaycpujvhhx77qcd+d4_xm6vi...@mail.gmail.com> >>> Content-Type: text/plain; charset="utf-8" >>> >>> Hi Markus, >>> >>> I tried your code and find the issue might be there are only 18 docs >>> in the Gutenberg >>> corpus. >>> if you print out transformed doc topic distribution, you will see a >>> lot of >>> topics are not used. >>> And since there is no words assigned to those topics, the weights >>> will be >>> equal to`topic_word_prior` parameter. >>> >>> You can print out the transformed doc topic distributions like this: >>> ------------- >>>>>> doc_distr = lda.fit_transform(tf) >>>>>> for d in doc_distr: >>> ... print np.where(d > 0.001)[0] >>> ... >>> [17 27] >>> [17 27] >>> [17 27 28] >>> [14] >>> [ 2 4 28] >>> [ 2 4 15 21 27 28] >>> [1] >>> [ 1 2 17 21 27 28] >>> [ 2 15 17 22 28] >>> [ 2 17 21 22 27 28] >>> [ 2 15 17 28] >>> [ 2 17 21 27 28] >>> [ 2 14 15 17 21 22 27 28] >>> [15 22] >>> [ 8 11] >>> [8] >>> [ 8 24] >>> [ 2 14 15 22] >>> >>> and my full test scripts are here: >>> https://gist.github.com/chyikwei/1707b59e009d381e1ce1e7a38f9c7826 >>> >>> Best, >>> Chyi-Kwei >>> >>> >>> On Thu, Sep 14, 2017 at 7:26 AM Markus Konrad <markus.kon...@wzb.eu> >>> wrote: >>> >>>> Hi there, >>>> >>>> I'm trying out sklearn's latent Dirichlet allocation implementation for >>>> topic modeling. The code from the official example [1] works just >>>> fine and >>>> the extracted topics look reasonable. However, when I try other >>>> corpora, >>>> for example the Gutenberg corpus from NLTK, most of the extracted >>>> topics >>>> are garbage. See this example output, when trying to get 30 topics: >>>> >>>> Topic #0: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues >>>> (0.01) >>>> fatiguing (0.01) >>>> Topic #1: mr (1081.61) emma (866.01) miss (506.94) mrs (445.56) jane >>>> (301.83) >>>> Topic #2: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues >>>> (0.01) >>>> fatiguing (0.01) >>>> Topic #3: thee (82.64) thou (70.0) thy (66.66) father (56.45) mother >>>> (55.27) >>>> Topic #4: anne (498.74) captain (303.01) lady (173.96) mr (172.07) >>>> charles >>>> (166.21) >>>> Topic #5: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues >>>> (0.01) >>>> fatiguing (0.01) >>>> Topic #6: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues >>>> (0.01) >>>> fatiguing (0.01) >>>> Topic #7: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues >>>> (0.01) >>>> fatiguing (0.01) >>>> ... >>>> >>>> Many topics tend to have the same weights, all equal to the >>>> `topic_word_prior` parameter. >>>> >>>> This is my script: >>>> >>>> import nltk >>>> from sklearn.feature_extraction.text import CountVectorizer >>>> from sklearn.decomposition import LatentDirichletAllocation >>>> >>>> def print_top_words(model, feature_names, n_top_words): >>>> for topic_idx, topic in enumerate(model.components_): >>>> message = "Topic #%d: " % topic_idx >>>> message += " ".join([feature_names[i] + " (" + >>>> str(round(topic[i], >>>> 2)) + ")" >>>> for i in topic.argsort()[:-n_top_words - >>>> 1:-1]]) >>>> print(message) >>>> >>>> >>>> data_samples = [nltk.corpus.gutenberg.raw(f_id) >>>> for f_id in nltk.corpus.gutenberg.fileids()] >>>> >>>> tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, >>>> stop_words='english') >>>> tf = tf_vectorizer.fit_transform(data_samples) >>>> >>>> lda = LatentDirichletAllocation(n_components=30, >>>> learning_method='batch', >>>> n_jobs=-1, # all CPUs >>>> verbose=1, >>>> evaluate_every=10, >>>> max_iter=1000, >>>> doc_topic_prior=0.1, >>>> topic_word_prior=0.01, >>>> random_state=1) >>>> >>>> lda.fit(tf) >>>> tf_feature_names = tf_vectorizer.get_feature_names() >>>> print_top_words(lda, tf_feature_names, 5) >>>> >>>> Is there a problem in how I set up the LatentDirichletAllocation >>>> instance >>>> or pass the data? I tried out different parameter settings, but none of >>>> them provided good results for that corpus. I also tried out >>>> alternative >>>> implementations (like the lda package [2]) and those were able to find >>>> reasonable topics. >>>> >>>> Best, >>>> Markus >>>> >>>> >>>> [1] >>>> http://scikit-learn.org/stable/auto_examples/applications/plot_topics_extraction_with_nmf_lda.html#sphx-glr-auto-examples-applications-plot-topics-extraction-with-nmf-lda-py >>>> >>>> [2] http://pythonhosted.org/lda/ >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn@python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn