Sorry, I meant of course "the number that *maximized* the log likelihood" in the first sentence...
On 09/20/2017 09:18 AM, Markus Konrad wrote: > I tried it with 12 topics (that's the number that minimized the log > likelihood) and there were also some very general topics. But the Gibbs > sampling didn't extract "empty topics" (those with all weights equal to > `topic_word_prior`) as opposed to sklearn's implementation. This is what > puzzled me. > > It isn't actually "little" data. The documents themselves are quite big. > But I think that this is where my thinking went wrong initially. I > thought that if 18 big documents cover a certain set of topics, then if > I split these documents into more, but smaller documents, a similar set > of topics should be discovered. But you're right, the latter contains > more information. Taken to an extreme: If I had only 1 document, it > wouldn't be possible to find the topics in there with LDA. > > Best, > Markus > > > > On 09/19/2017 06:07 PM, Andreas Mueller wrote: >> I'm actually surprised the gibbs sampling gave useful results with so >> little data. >> And splitting the documents results in very different data. It has a lot >> more information. >> How many topics did you use? >> >> Also: PR for docs welcome! >> >> On 09/19/2017 04:26 AM, Markus Konrad wrote: >>> This is indeed interesting. I didn't know that there are so big >>> differences between these approaches. I split the 18 documents into >>> sub-documents of 5 paragraphs each, so that I got around 10k of these >>> sub-documents. Now, scikit-learn and gensim deliver much better results, >>> quite similar to those from a Gibbs sampling based implementation. So it >>> was basically the same data, just split in a different way. >>> >>> I think the disadvantages/limits of the Variational Bayes approach >>> should be mentioned in the documentation. >>> >>> Best, >>> Markus >>> >>> >>> >>> On 09/18/2017 06:59 PM, Andreas Mueller wrote: >>>> For very few documents, Gibbs sampling is likely to work better - or >>>> rather, Gibbs sampling usually works >>>> better given enough runtime, and for so few documents, runtime is not an >>>> issue. >>>> The length of the documents don't matter, only the size of the >>>> vocabulary. >>>> Also, hyper parameter choices might need to be different for Gibbs >>>> sampling vs variational inference. >>>> >>>> On 09/18/2017 12:26 PM, Markus Konrad wrote: >>>>> Hi Chyi-Kwei, >>>>> >>>>> thanks for digging into this. I made similar observations with Gensim >>>>> when using only a small number of (big) documents. Gensim also uses the >>>>> Online Variational Bayes approach (Hoffman et al.). So could it be that >>>>> the Hoffman et al. method is problematic in such scenarios? I found >>>>> that >>>>> Gibbs sampling based implementations provide much more informative >>>>> topics in this case. >>>>> >>>>> If this was the case, then if I'd slice the documents in some way (say >>>>> every N paragraphs become a "document") then I should get better >>>>> results >>>>> with scikit-learn and Gensim, right? I think I'll try this out >>>>> tomorrow. >>>>> >>>>> Best, >>>>> Markus >>>>> >>>>> >>>>> >>>>>> Date: Sun, 17 Sep 2017 23:52:51 +0000 >>>>>> From: chyi-kwei yau <chyikwei....@gmail.com> >>>>>> To: Scikit-learn mailing list <scikit-learn@python.org> >>>>>> Subject: Re: [scikit-learn] LatentDirichletAllocation failing to find >>>>>> topics in NLTK Gutenberg corpus? >>>>>> Message-ID: >>>>>> <cak-jh0ygd8fsdjom+gddohvaycpujvhhx77qcd+d4_xm6vi...@mail.gmail.com> >>>>>> >>>>>> Content-Type: text/plain; charset="utf-8" >>>>>> >>>>>> Hi Markus, >>>>>> >>>>>> I tried your code and find the issue might be there are only 18 docs >>>>>> in the Gutenberg >>>>>> corpus. >>>>>> if you print out transformed doc topic distribution, you will see a >>>>>> lot of >>>>>> topics are not used. >>>>>> And since there is no words assigned to those topics, the weights >>>>>> will be >>>>>> equal to`topic_word_prior` parameter. >>>>>> >>>>>> You can print out the transformed doc topic distributions like this: >>>>>> ------------- >>>>>>>>> doc_distr = lda.fit_transform(tf) >>>>>>>>> for d in doc_distr: >>>>>> ... print np.where(d > 0.001)[0] >>>>>> ... >>>>>> [17 27] >>>>>> [17 27] >>>>>> [17 27 28] >>>>>> [14] >>>>>> [ 2 4 28] >>>>>> [ 2 4 15 21 27 28] >>>>>> [1] >>>>>> [ 1 2 17 21 27 28] >>>>>> [ 2 15 17 22 28] >>>>>> [ 2 17 21 22 27 28] >>>>>> [ 2 15 17 28] >>>>>> [ 2 17 21 27 28] >>>>>> [ 2 14 15 17 21 22 27 28] >>>>>> [15 22] >>>>>> [ 8 11] >>>>>> [8] >>>>>> [ 8 24] >>>>>> [ 2 14 15 22] >>>>>> >>>>>> and my full test scripts are here: >>>>>> https://gist.github.com/chyikwei/1707b59e009d381e1ce1e7a38f9c7826 >>>>>> >>>>>> Best, >>>>>> Chyi-Kwei >>>>>> >>>>>> >>>>>> On Thu, Sep 14, 2017 at 7:26 AM Markus Konrad <markus.kon...@wzb.eu> >>>>>> wrote: >>>>>> >>>>>>> Hi there, >>>>>>> >>>>>>> I'm trying out sklearn's latent Dirichlet allocation >>>>>>> implementation for >>>>>>> topic modeling. The code from the official example [1] works just >>>>>>> fine and >>>>>>> the extracted topics look reasonable. However, when I try other >>>>>>> corpora, >>>>>>> for example the Gutenberg corpus from NLTK, most of the extracted >>>>>>> topics >>>>>>> are garbage. See this example output, when trying to get 30 topics: >>>>>>> >>>>>>> Topic #0: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues >>>>>>> (0.01) >>>>>>> fatiguing (0.01) >>>>>>> Topic #1: mr (1081.61) emma (866.01) miss (506.94) mrs (445.56) jane >>>>>>> (301.83) >>>>>>> Topic #2: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues >>>>>>> (0.01) >>>>>>> fatiguing (0.01) >>>>>>> Topic #3: thee (82.64) thou (70.0) thy (66.66) father (56.45) mother >>>>>>> (55.27) >>>>>>> Topic #4: anne (498.74) captain (303.01) lady (173.96) mr (172.07) >>>>>>> charles >>>>>>> (166.21) >>>>>>> Topic #5: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues >>>>>>> (0.01) >>>>>>> fatiguing (0.01) >>>>>>> Topic #6: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues >>>>>>> (0.01) >>>>>>> fatiguing (0.01) >>>>>>> Topic #7: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues >>>>>>> (0.01) >>>>>>> fatiguing (0.01) >>>>>>> ... >>>>>>> >>>>>>> Many topics tend to have the same weights, all equal to the >>>>>>> `topic_word_prior` parameter. >>>>>>> >>>>>>> This is my script: >>>>>>> >>>>>>> import nltk >>>>>>> from sklearn.feature_extraction.text import CountVectorizer >>>>>>> from sklearn.decomposition import LatentDirichletAllocation >>>>>>> >>>>>>> def print_top_words(model, feature_names, n_top_words): >>>>>>> for topic_idx, topic in enumerate(model.components_): >>>>>>> message = "Topic #%d: " % topic_idx >>>>>>> message += " ".join([feature_names[i] + " (" + >>>>>>> str(round(topic[i], >>>>>>> 2)) + ")" >>>>>>> for i in >>>>>>> topic.argsort()[:-n_top_words - >>>>>>> 1:-1]]) >>>>>>> print(message) >>>>>>> >>>>>>> >>>>>>> data_samples = [nltk.corpus.gutenberg.raw(f_id) >>>>>>> for f_id in nltk.corpus.gutenberg.fileids()] >>>>>>> >>>>>>> tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, >>>>>>> stop_words='english') >>>>>>> tf = tf_vectorizer.fit_transform(data_samples) >>>>>>> >>>>>>> lda = LatentDirichletAllocation(n_components=30, >>>>>>> learning_method='batch', >>>>>>> n_jobs=-1, # all CPUs >>>>>>> verbose=1, >>>>>>> evaluate_every=10, >>>>>>> max_iter=1000, >>>>>>> doc_topic_prior=0.1, >>>>>>> topic_word_prior=0.01, >>>>>>> random_state=1) >>>>>>> >>>>>>> lda.fit(tf) >>>>>>> tf_feature_names = tf_vectorizer.get_feature_names() >>>>>>> print_top_words(lda, tf_feature_names, 5) >>>>>>> >>>>>>> Is there a problem in how I set up the LatentDirichletAllocation >>>>>>> instance >>>>>>> or pass the data? I tried out different parameter settings, but >>>>>>> none of >>>>>>> them provided good results for that corpus. I also tried out >>>>>>> alternative >>>>>>> implementations (like the lda package [2]) and those were able to >>>>>>> find >>>>>>> reasonable topics. >>>>>>> >>>>>>> Best, >>>>>>> Markus >>>>>>> >>>>>>> >>>>>>> [1] >>>>>>> http://scikit-learn.org/stable/auto_examples/applications/plot_topics_extraction_with_nmf_lda.html#sphx-glr-auto-examples-applications-plot-topics-extraction-with-nmf-lda-py >>>>>>> >>>>>>> >>>>>>> [2] http://pythonhosted.org/lda/ >>>>> _______________________________________________ >>>>> scikit-learn mailing list >>>>> scikit-learn@python.org >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn@python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn@python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn@python.org >> https://mail.python.org/mailman/listinfo/scikit-learn -- -- Markus Konrad - DV / Data Science - fon: +49 30 25491 555 fax: +49 30 25491 558 mail: markus.kon...@wzb.eu WZB Data Science Blog: https://datascience.blog.wzb.eu/ Raum D 005 WZB – Wissenschaftszentrum Berlin für Sozialforschung Reichpietschufer 50 D-10785 Berlin _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn