For very few documents, Gibbs sampling is likely to work better - or rather, Gibbs sampling usually works better given enough runtime, and for so few documents, runtime is not an issue.
The length of the documents don't matter, only the size of the vocabulary.
Also, hyper parameter choices might need to be different for Gibbs sampling vs variational inference.

On 09/18/2017 12:26 PM, Markus Konrad wrote:
Hi Chyi-Kwei,

thanks for digging into this. I made similar observations with Gensim
when using only a small number of (big) documents. Gensim also uses the
Online Variational Bayes approach (Hoffman et al.). So could it be that
the Hoffman et al. method is problematic in such scenarios? I found that
Gibbs sampling based implementations provide much more informative
topics in this case.

If this was the case, then if I'd slice the documents in some way (say
every N paragraphs become a "document") then I should get better results
with scikit-learn and Gensim, right? I think I'll try this out tomorrow.

Best,
Markus



Date: Sun, 17 Sep 2017 23:52:51 +0000
From: chyi-kwei yau <chyikwei....@gmail.com>
To: Scikit-learn mailing list <scikit-learn@python.org>
Subject: Re: [scikit-learn] LatentDirichletAllocation failing to find
        topics in NLTK Gutenberg corpus?
Message-ID:
        <cak-jh0ygd8fsdjom+gddohvaycpujvhhx77qcd+d4_xm6vi...@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Hi Markus,

I tried your code and find the issue might be there are only 18 docs
in the Gutenberg
corpus.
if you print out transformed doc topic distribution, you will see a lot of
topics are not used.
And since there is no words assigned to those topics, the weights will be
equal to`topic_word_prior` parameter.

You can print out the transformed doc topic distributions like this:
-------------
doc_distr = lda.fit_transform(tf)
for d in doc_distr:
...     print np.where(d > 0.001)[0]
...
[17 27]
[17 27]
[17 27 28]
[14]
[ 2  4 28]
[ 2  4 15 21 27 28]
[1]
[ 1  2 17 21 27 28]
[ 2 15 17 22 28]
[ 2 17 21 22 27 28]
[ 2 15 17 28]
[ 2 17 21 27 28]
[ 2 14 15 17 21 22 27 28]
[15 22]
[ 8 11]
[8]
[ 8 24]
[ 2 14 15 22]

and my full test scripts are here:
https://gist.github.com/chyikwei/1707b59e009d381e1ce1e7a38f9c7826

Best,
Chyi-Kwei


On Thu, Sep 14, 2017 at 7:26 AM Markus Konrad <markus.kon...@wzb.eu> wrote:

Hi there,

I'm trying out sklearn's latent Dirichlet allocation implementation for
topic modeling. The code from the official example [1] works just fine and
the extracted topics look reasonable. However, when I try other corpora,
for example the Gutenberg corpus from NLTK, most of the extracted topics
are garbage. See this example output, when trying to get 30 topics:

Topic #0: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues (0.01)
fatiguing (0.01)
Topic #1: mr (1081.61) emma (866.01) miss (506.94) mrs (445.56) jane
(301.83)
Topic #2: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues (0.01)
fatiguing (0.01)
Topic #3: thee (82.64) thou (70.0) thy (66.66) father (56.45) mother
(55.27)
Topic #4: anne (498.74) captain (303.01) lady (173.96) mr (172.07) charles
(166.21)
Topic #5: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues (0.01)
fatiguing (0.01)
Topic #6: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues (0.01)
fatiguing (0.01)
Topic #7: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues (0.01)
fatiguing (0.01)
...

Many topics tend to have the same weights, all equal to the
`topic_word_prior` parameter.

This is my script:

import nltk
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

def print_top_words(model, feature_names, n_top_words):
     for topic_idx, topic in enumerate(model.components_):
         message = "Topic #%d: " % topic_idx
         message += " ".join([feature_names[i] + " (" + str(round(topic[i],
2)) + ")"
                              for i in topic.argsort()[:-n_top_words -
1:-1]])
         print(message)


data_samples = [nltk.corpus.gutenberg.raw(f_id)
                for f_id in nltk.corpus.gutenberg.fileids()]

tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2,
                                 stop_words='english')
tf = tf_vectorizer.fit_transform(data_samples)

lda = LatentDirichletAllocation(n_components=30,
                                 learning_method='batch',
                                 n_jobs=-1,  # all CPUs
                                 verbose=1,
                                 evaluate_every=10,
                                 max_iter=1000,
                                 doc_topic_prior=0.1,
                                 topic_word_prior=0.01,
                                 random_state=1)

lda.fit(tf)
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, 5)

Is there a problem in how I set up the LatentDirichletAllocation instance
or pass the data? I tried out different parameter settings, but none of
them provided good results for that corpus. I also tried out alternative
implementations (like the lda package [2]) and those were able to find
reasonable topics.

Best,
Markus


[1]
http://scikit-learn.org/stable/auto_examples/applications/plot_topics_extraction_with_nmf_lda.html#sphx-glr-auto-examples-applications-plot-topics-extraction-with-nmf-lda-py
[2] http://pythonhosted.org/lda/
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Reply via email to