Re: [scikit-learn] LatentDirichletAllocation failing to find topics in NLTK Gutenberg corpus?

Markus Konrad Wed, 20 Sep 2017 03:23:30 -0700

Sorry, I meant of course "the number that *maximized* the log
likelihood" in the first sentence...



On 09/20/2017 09:18 AM, Markus Konrad wrote:
> I tried it with 12 topics (that's the number that minimized the log
> likelihood) and there were also some very general topics. But the Gibbs
> sampling didn't extract "empty topics" (those with all weights equal to
> `topic_word_prior`) as opposed to sklearn's implementation. This is what
> puzzled me.
> 
> It isn't actually "little" data. The documents themselves are quite big.
> But I think that this is where my thinking went wrong initially. I
> thought that if 18 big documents cover a certain set of topics, then if
> I split these documents into more, but smaller documents, a similar set
> of topics should be discovered. But you're right, the latter contains
> more information. Taken to an extreme: If I had only 1 document, it
> wouldn't be possible to find the topics in there with LDA.
> 
> Best,
> Markus
> 
> 
> 
> On 09/19/2017 06:07 PM, Andreas Mueller wrote:
>> I'm actually surprised the gibbs sampling gave useful results with so
>> little data.
>> And splitting the documents results in very different data. It has a lot
>> more information.
>> How many topics did you use?
>>
>> Also: PR for docs welcome!
>>
>> On 09/19/2017 04:26 AM, Markus Konrad wrote:
>>> This is indeed interesting. I didn't know that there are so big
>>> differences between these approaches. I split the 18 documents into
>>> sub-documents of 5 paragraphs each, so that I got around 10k of these
>>> sub-documents. Now, scikit-learn and gensim deliver much better results,
>>> quite similar to those from a Gibbs sampling based implementation. So it
>>> was basically the same data, just split in a different way.
>>>
>>> I think the disadvantages/limits of the Variational Bayes approach
>>> should be mentioned in the documentation.
>>>
>>> Best,
>>> Markus
>>>
>>>
>>>
>>> On 09/18/2017 06:59 PM, Andreas Mueller wrote:
>>>> For very few documents, Gibbs sampling is likely to work better - or
>>>> rather, Gibbs sampling usually works
>>>> better given enough runtime, and for so few documents, runtime is not an
>>>> issue.
>>>> The length of the documents don't matter, only the size of the
>>>> vocabulary.
>>>> Also, hyper parameter choices might need to be different for Gibbs
>>>> sampling vs variational inference.
>>>>
>>>> On 09/18/2017 12:26 PM, Markus Konrad wrote:
>>>>> Hi Chyi-Kwei,
>>>>>
>>>>> thanks for digging into this. I made similar observations with Gensim
>>>>> when using only a small number of (big) documents. Gensim also uses the
>>>>> Online Variational Bayes approach (Hoffman et al.). So could it be that
>>>>> the Hoffman et al. method is problematic in such scenarios? I found
>>>>> that
>>>>> Gibbs sampling based implementations provide much more informative
>>>>> topics in this case.
>>>>>
>>>>> If this was the case, then if I'd slice the documents in some way (say
>>>>> every N paragraphs become a "document") then I should get better
>>>>> results
>>>>> with scikit-learn and Gensim, right? I think I'll try this out
>>>>> tomorrow.
>>>>>
>>>>> Best,
>>>>> Markus
>>>>>
>>>>>
>>>>>
>>>>>> Date: Sun, 17 Sep 2017 23:52:51 +0000
>>>>>> From: chyi-kwei yau <[email protected]>
>>>>>> To: Scikit-learn mailing list <[email protected]>
>>>>>> Subject: Re: [scikit-learn] LatentDirichletAllocation failing to find
>>>>>>      topics in NLTK Gutenberg corpus?
>>>>>> Message-ID:
>>>>>>      <cak-jh0ygd8fsdjom+gddohvaycpujvhhx77qcd+d4_xm6vi...@mail.gmail.com>
>>>>>>
>>>>>> Content-Type: text/plain; charset="utf-8"
>>>>>>
>>>>>> Hi Markus,
>>>>>>
>>>>>> I tried your code and find the issue might be there are only 18 docs
>>>>>> in the Gutenberg
>>>>>> corpus.
>>>>>> if you print out transformed doc topic distribution, you will see a
>>>>>> lot of
>>>>>> topics are not used.
>>>>>> And since there is no words assigned to those topics, the weights
>>>>>> will be
>>>>>> equal to`topic_word_prior` parameter.
>>>>>>
>>>>>> You can print out the transformed doc topic distributions like this:
>>>>>> -------------
>>>>>>>>> doc_distr = lda.fit_transform(tf)
>>>>>>>>> for d in doc_distr:
>>>>>> ...     print np.where(d > 0.001)[0]
>>>>>> ...
>>>>>> [17 27]
>>>>>> [17 27]
>>>>>> [17 27 28]
>>>>>> [14]
>>>>>> [ 2  4 28]
>>>>>> [ 2  4 15 21 27 28]
>>>>>> [1]
>>>>>> [ 1  2 17 21 27 28]
>>>>>> [ 2 15 17 22 28]
>>>>>> [ 2 17 21 22 27 28]
>>>>>> [ 2 15 17 28]
>>>>>> [ 2 17 21 27 28]
>>>>>> [ 2 14 15 17 21 22 27 28]
>>>>>> [15 22]
>>>>>> [ 8 11]
>>>>>> [8]
>>>>>> [ 8 24]
>>>>>> [ 2 14 15 22]
>>>>>>
>>>>>> and my full test scripts are here:
>>>>>> https://gist.github.com/chyikwei/1707b59e009d381e1ce1e7a38f9c7826
>>>>>>
>>>>>> Best,
>>>>>> Chyi-Kwei
>>>>>>
>>>>>>
>>>>>> On Thu, Sep 14, 2017 at 7:26 AM Markus Konrad <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi there,
>>>>>>>
>>>>>>> I'm trying out sklearn's latent Dirichlet allocation
>>>>>>> implementation for
>>>>>>> topic modeling. The code from the official example [1] works just
>>>>>>> fine and
>>>>>>> the extracted topics look reasonable. However, when I try other
>>>>>>> corpora,
>>>>>>> for example the Gutenberg corpus from NLTK, most of the extracted
>>>>>>> topics
>>>>>>> are garbage. See this example output, when trying to get 30 topics:
>>>>>>>
>>>>>>> Topic #0: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues
>>>>>>> (0.01)
>>>>>>> fatiguing (0.01)
>>>>>>> Topic #1: mr (1081.61) emma (866.01) miss (506.94) mrs (445.56) jane
>>>>>>> (301.83)
>>>>>>> Topic #2: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues
>>>>>>> (0.01)
>>>>>>> fatiguing (0.01)
>>>>>>> Topic #3: thee (82.64) thou (70.0) thy (66.66) father (56.45) mother
>>>>>>> (55.27)
>>>>>>> Topic #4: anne (498.74) captain (303.01) lady (173.96) mr (172.07)
>>>>>>> charles
>>>>>>> (166.21)
>>>>>>> Topic #5: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues
>>>>>>> (0.01)
>>>>>>> fatiguing (0.01)
>>>>>>> Topic #6: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues
>>>>>>> (0.01)
>>>>>>> fatiguing (0.01)
>>>>>>> Topic #7: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues
>>>>>>> (0.01)
>>>>>>> fatiguing (0.01)
>>>>>>> ...
>>>>>>>
>>>>>>> Many topics tend to have the same weights, all equal to the
>>>>>>> `topic_word_prior` parameter.
>>>>>>>
>>>>>>> This is my script:
>>>>>>>
>>>>>>> import nltk
>>>>>>> from sklearn.feature_extraction.text import CountVectorizer
>>>>>>> from sklearn.decomposition import LatentDirichletAllocation
>>>>>>>
>>>>>>> def print_top_words(model, feature_names, n_top_words):
>>>>>>>       for topic_idx, topic in enumerate(model.components_):
>>>>>>>           message = "Topic #%d: " % topic_idx
>>>>>>>           message += " ".join([feature_names[i] + " (" +
>>>>>>> str(round(topic[i],
>>>>>>> 2)) + ")"
>>>>>>>                                for i in
>>>>>>> topic.argsort()[:-n_top_words -
>>>>>>> 1:-1]])
>>>>>>>           print(message)
>>>>>>>
>>>>>>>
>>>>>>> data_samples = [nltk.corpus.gutenberg.raw(f_id)
>>>>>>>                  for f_id in nltk.corpus.gutenberg.fileids()]
>>>>>>>
>>>>>>> tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2,
>>>>>>>                                   stop_words='english')
>>>>>>> tf = tf_vectorizer.fit_transform(data_samples)
>>>>>>>
>>>>>>> lda = LatentDirichletAllocation(n_components=30,
>>>>>>>                                   learning_method='batch',
>>>>>>>                                   n_jobs=-1,  # all CPUs
>>>>>>>                                   verbose=1,
>>>>>>>                                   evaluate_every=10,
>>>>>>>                                   max_iter=1000,
>>>>>>>                                   doc_topic_prior=0.1,
>>>>>>>                                   topic_word_prior=0.01,
>>>>>>>                                   random_state=1)
>>>>>>>
>>>>>>> lda.fit(tf)
>>>>>>> tf_feature_names = tf_vectorizer.get_feature_names()
>>>>>>> print_top_words(lda, tf_feature_names, 5)
>>>>>>>
>>>>>>> Is there a problem in how I set up the LatentDirichletAllocation
>>>>>>> instance
>>>>>>> or pass the data? I tried out different parameter settings, but
>>>>>>> none of
>>>>>>> them provided good results for that corpus. I also tried out
>>>>>>> alternative
>>>>>>> implementations (like the lda package [2]) and those were able to
>>>>>>> find
>>>>>>> reasonable topics.
>>>>>>>
>>>>>>> Best,
>>>>>>> Markus
>>>>>>>
>>>>>>>
>>>>>>> [1]
>>>>>>> http://scikit-learn.org/stable/auto_examples/applications/plot_topics_extraction_with_nmf_lda.html#sphx-glr-auto-examples-applications-plot-topics-extraction-with-nmf-lda-py
>>>>>>>
>>>>>>>
>>>>>>> [2] http://pythonhosted.org/lda/
>>>>> _______________________________________________
>>>>> scikit-learn mailing list
>>>>> [email protected]
>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>> _______________________________________________
>>>> scikit-learn mailing list
>>>> [email protected]
>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>> _______________________________________________
>>> scikit-learn mailing list
>>> [email protected]
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>> _______________________________________________
>> scikit-learn mailing list
>> [email protected]
>> https://mail.python.org/mailman/listinfo/scikit-learn

-- 
--

Markus Konrad
- DV / Data Science -

fon: +49 30 25491 555
fax: +49 30 25491 558
mail: [email protected]

WZB Data Science Blog: https://datascience.blog.wzb.eu/

Raum D 005
WZB – Wissenschaftszentrum Berlin für Sozialforschung
Reichpietschufer 50
D-10785 Berlin
_______________________________________________
scikit-learn mailing list
[email protected]
https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] LatentDirichletAllocation failing to find topics in NLTK Gutenberg corpus?

Reply via email to