Re: [scikit-learn] LatentDirichletAllocation failing to find topics in NLTK Gutenberg corpus?

Andreas Mueller Tue, 19 Sep 2017 09:15:56 -0700

I'm actually surprised the gibbs sampling gave useful results with solittle data.And splitting the documents results in very different data. It has a lotmore information.

How many topics did you use?

Also: PR for docs welcome!


On 09/19/2017 04:26 AM, Markus Konrad wrote:

This is indeed interesting. I didn't know that there are so big
differences between these approaches. I split the 18 documents into
sub-documents of 5 paragraphs each, so that I got around 10k of these
sub-documents. Now, scikit-learn and gensim deliver much better results,
quite similar to those from a Gibbs sampling based implementation. So it
was basically the same data, just split in a different way.

I think the disadvantages/limits of the Variational Bayes approach
should be mentioned in the documentation.

Best,
Markus



On 09/18/2017 06:59 PM, Andreas Mueller wrote:

For very few documents, Gibbs sampling is likely to work better - or
rather, Gibbs sampling usually works
better given enough runtime, and for so few documents, runtime is not an
issue.
The length of the documents don't matter, only the size of the vocabulary.
Also, hyper parameter choices might need to be different for Gibbs
sampling vs variational inference.

On 09/18/2017 12:26 PM, Markus Konrad wrote:

Hi Chyi-Kwei,

thanks for digging into this. I made similar observations with Gensim
when using only a small number of (big) documents. Gensim also uses the
Online Variational Bayes approach (Hoffman et al.). So could it be that
the Hoffman et al. method is problematic in such scenarios? I found that
Gibbs sampling based implementations provide much more informative
topics in this case.

If this was the case, then if I'd slice the documents in some way (say
every N paragraphs become a "document") then I should get better results
with scikit-learn and Gensim, right? I think I'll try this out tomorrow.

Best,
Markus

Date: Sun, 17 Sep 2017 23:52:51 +0000
From: chyi-kwei yau <chyikwei....@gmail.com>
To: Scikit-learn mailing list <scikit-learn@python.org>
Subject: Re: [scikit-learn] LatentDirichletAllocation failing to find
     topics in NLTK Gutenberg corpus?
Message-ID:
     <cak-jh0ygd8fsdjom+gddohvaycpujvhhx77qcd+d4_xm6vi...@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Hi Markus,

I tried your code and find the issue might be there are only 18 docs
in the Gutenberg
corpus.
if you print out transformed doc topic distribution, you will see a
lot of
topics are not used.
And since there is no words assigned to those topics, the weights
will be
equal to`topic_word_prior` parameter.

You can print out the transformed doc topic distributions like this:
-------------

doc_distr = lda.fit_transform(tf)
for d in doc_distr:

...     print np.where(d > 0.001)[0]
...
[17 27]
[17 27]
[17 27 28]
[14]
[ 2  4 28]
[ 2  4 15 21 27 28]
[1]
[ 1  2 17 21 27 28]
[ 2 15 17 22 28]
[ 2 17 21 22 27 28]
[ 2 15 17 28]
[ 2 17 21 27 28]
[ 2 14 15 17 21 22 27 28]
[15 22]
[ 8 11]
[8]
[ 8 24]
[ 2 14 15 22]

and my full test scripts are here:
https://gist.github.com/chyikwei/1707b59e009d381e1ce1e7a38f9c7826

Best,
Chyi-Kwei


On Thu, Sep 14, 2017 at 7:26 AM Markus Konrad <markus.kon...@wzb.eu>
wrote:

Hi there,

I'm trying out sklearn's latent Dirichlet allocation implementation for
topic modeling. The code from the official example [1] works just
fine and
the extracted topics look reasonable. However, when I try other
corpora,
for example the Gutenberg corpus from NLTK, most of the extracted
topics
are garbage. See this example output, when trying to get 30 topics:

Topic #0: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues
(0.01)
fatiguing (0.01)
Topic #1: mr (1081.61) emma (866.01) miss (506.94) mrs (445.56) jane
(301.83)
Topic #2: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues
(0.01)
fatiguing (0.01)
Topic #3: thee (82.64) thou (70.0) thy (66.66) father (56.45) mother
(55.27)
Topic #4: anne (498.74) captain (303.01) lady (173.96) mr (172.07)
charles
(166.21)
Topic #5: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues
(0.01)
fatiguing (0.01)
Topic #6: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues
(0.01)
fatiguing (0.01)
Topic #7: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues
(0.01)
fatiguing (0.01)
...

Many topics tend to have the same weights, all equal to the
`topic_word_prior` parameter.

This is my script:

import nltk
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

def print_top_words(model, feature_names, n_top_words):
      for topic_idx, topic in enumerate(model.components_):
          message = "Topic #%d: " % topic_idx
          message += " ".join([feature_names[i] + " (" +
str(round(topic[i],
2)) + ")"
                               for i in topic.argsort()[:-n_top_words -
1:-1]])
          print(message)


data_samples = [nltk.corpus.gutenberg.raw(f_id)
                 for f_id in nltk.corpus.gutenberg.fileids()]

tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2,
                                  stop_words='english')
tf = tf_vectorizer.fit_transform(data_samples)

lda = LatentDirichletAllocation(n_components=30,
                                  learning_method='batch',
                                  n_jobs=-1,  # all CPUs
                                  verbose=1,
                                  evaluate_every=10,
                                  max_iter=1000,
                                  doc_topic_prior=0.1,
                                  topic_word_prior=0.01,
                                  random_state=1)

lda.fit(tf)
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, 5)

Is there a problem in how I set up the LatentDirichletAllocation
instance
or pass the data? I tried out different parameter settings, but none of
them provided good results for that corpus. I also tried out
alternative
implementations (like the lda package [2]) and those were able to find
reasonable topics.

Best,
Markus


[1]
http://scikit-learn.org/stable/auto_examples/applications/plot_topics_extraction_with_nmf_lda.html#sphx-glr-auto-examples-applications-plot-topics-extraction-with-nmf-lda-py

[2] http://pythonhosted.org/lda/

_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] LatentDirichletAllocation failing to find topics in NLTK Gutenberg corpus?

Reply via email to