I usually use an absolute threshold for min_df and a relative one for max_df. I find it very useful to look at the histogram of word dfs for choosing the latter, it varies a lot from dataset to dataset. For short texts, like tweets, words such as "the" can have a df of 0.1.
It's very easy to look at dfs, just get a transformed X out of your vectorizer and do: >>> df = (X > 0).sum(axis=0) >>> df = df.A.ravel().astype(np.double) >>> df /= X.shape[0] My 2c, Vlad On Tue, Feb 9, 2016 at 3:05 AM, Manjush Vundemodalu <manjus...@gmail.com> wrote: > I think you have most of words filtered out of tf because of the condition > min_df=0.05, > > I faced similar problems while working with chat data and I tried min_df=2 > instead of using float value and it worked > > Regards, > Manjush > > > > On Wed, Jan 27, 2016 at 4:31 AM, Rockenkamm, Christian > <c.rockenk...@stud.uni-goettingen.de> wrote: >> >> I used more datasets in a range from 2200 to 3500 distinct words in the tf >> for training the LDA. This data are preprocessed with lemmatizing before >> CountVectorizrt. >> ________________________________ >> Von: Joel Nothman [joel.noth...@gmail.com] >> Gesendet: Dienstag, 26. Januar 2016 23:35 >> An: scikit-learn-general >> Betreff: Re: [Scikit-learn-general] Latent Dirichlet Allocation >> >> How many distinct words are in your dataset? >> >> On 27 January 2016 at 00:21, Rockenkamm, Christian >> <c.rockenk...@stud.uni-goettingen.de> wrote: >>> >>> Hallo, >>> >>> >>> I have question concerning the Latent Dirichlet Allocation. The results I >>> get from using it are a bit confusing. >>> >>> At first I use about 3000 documents. In the preparation with the >>> CountVectorizrt I use the following parameters : max_df=0.95 and >>> min_df=0.05. >>> >>> For the LDA fit I use the bath learning method. For the other parameters >>> I have tried many different values. However regardless of which >>> configuration I used, I face one common problem. I get topics that are never >>> used in any of the docs and said topics all show the same structure >>> (topic-word-distribution). I even tried gensim with the same configuration >>> as scikit, yet I still encountered this problem. I also tried lowering the >>> number of topics in the model, but this did not lead to the expected results >>> either. For 100 topics, 20-27 were still affected by this problem, for 50 >>> topics, there were still 2-8 of them being affected, depending on the >>> parameter setting. >>> >>> Does anybody have an idea as to what might be causing this problem and >>> how to resolve it? >>> >>> >>> Best regards, >>> >>> Christian Rockenkamm >>> >>> >>> >>> ------------------------------------------------------------------------------ >>> Site24x7 APM Insight: Get Deep Visibility into Application Performance >>> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month >>> Monitor end-to-end web transactions and take corrective actions now >>> Troubleshoot faster and improve end-user experience. Signup Now! >>> http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140 >>> _______________________________________________ >>> Scikit-learn-general mailing list >>> Scikit-learn-general@lists.sourceforge.net >>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >>> >> >> >> >> ------------------------------------------------------------------------------ >> Site24x7 APM Insight: Get Deep Visibility into Application Performance >> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month >> Monitor end-to-end web transactions and take corrective actions now >> Troubleshoot faster and improve end-user experience. Signup Now! >> http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140 >> _______________________________________________ >> Scikit-learn-general mailing list >> Scikit-learn-general@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >> > > > ------------------------------------------------------------------------------ > Site24x7 APM Insight: Get Deep Visibility into Application Performance > APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month > Monitor end-to-end web transactions and take corrective actions now > Troubleshoot faster and improve end-user experience. Signup Now! > http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140 > _______________________________________________ > Scikit-learn-general mailing list > Scikit-learn-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > ------------------------------------------------------------------------------ Site24x7 APM Insight: Get Deep Visibility into Application Performance APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month Monitor end-to-end web transactions and take corrective actions now Troubleshoot faster and improve end-user experience. Signup Now! http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140 _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general