I think you have most of words filtered out of tf because of the condition
min_df=0.05,
I faced similar problems while working with chat data and I tried min_df=2
instead of using float value and it worked
Regards,
Manjush
On Wed, Jan 27, 2016 at 4:31 AM, Rockenkamm, Christian <
c.rockenk...@stud.uni-goettingen.de> wrote:
> I used more datasets in a range from 2200 to 3500 distinct words in the tf
> for training the LDA. This data are preprocessed with lemmatizing before
> CountVectorizrt.
> ------------------------------
> *Von:* Joel Nothman [joel.noth...@gmail.com]
> *Gesendet:* Dienstag, 26. Januar 2016 23:35
> *An:* scikit-learn-general
> *Betreff:* Re: [Scikit-learn-general] Latent Dirichlet Allocation
>
> How many distinct words are in your dataset?
>
> On 27 January 2016 at 00:21, Rockenkamm, Christian <
> c.rockenk...@stud.uni-goettingen.de> wrote:
>
>> Hallo,
>>
>>
>> I have question concerning the Latent Dirichlet Allocation. The results I
>> get from using it are a bit confusing.
>>
>> At first I use about 3000 documents. In the preparation with the
>> CountVectorizrt I use the following parameters : max_df=0.95 and
>> min_df=0.05.
>>
>> For the LDA fit I use the bath learning method. For the other parameters
>> I have tried many different values. However regardless of which
>> configuration I used, I face one common problem. I get topics that are
>> never used in any of the docs and said topics all show the same structure
>> (topic-word-distribution). I even tried gensim with the same configuration
>> as scikit, yet I still encountered this problem. I also tried lowering the
>> number of topics in the model, but this did not lead to the expected
>> results either. For 100 topics, 20-27 were still affected by this problem,
>> for 50 topics, there were still 2-8 of them being affected, depending on
>> the parameter setting.
>>
>> Does anybody have an idea as to what might be causing this problem and
>> how to resolve it?
>>
>>
>> Best regards,
>>
>> Christian Rockenkamm
>>
>>
>> ------------------------------------------------------------------------------
>> Site24x7 APM Insight: Get Deep Visibility into Application Performance
>> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
>> Monitor end-to-end web transactions and take corrective actions now
>> Troubleshoot faster and improve end-user experience. Signup Now!
>> http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>
>
> ------------------------------------------------------------------------------
> Site24x7 APM Insight: Get Deep Visibility into Application Performance
> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
> Monitor end-to-end web transactions and take corrective actions now
> Troubleshoot faster and improve end-user experience. Signup Now!
> http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general