Hi,

Thank you for your reply. I changed my delimeters from tab to space and most of 
the problem has been solved (1900 index term from 1914). However, still there 
are few words that are excluded. I didn't set any parameter as you can see in 
the code.

>>> tf = TfidfVectorizer(ngram_range=(1,ngram),use_idf=False) tf_matrix = 
>>> tf.fit_transform(corpus) feature_names = tf.get_feature_names()
Should I play with the min_df and max_df?

Best,
Ehsan

> On Nov 19, 2015, at 9:01 AM, Chris Holdgraf <choldg...@berkeley.edu> wrote:
> 
> If you vocab is indeed being cut down, could it be because some words don't 
> pass through the word frequency cutoff filters? (min_df, max_df)
> 
>> On Thu, Nov 19, 2015 at 8:55 AM, Andreas Mueller <t3k...@gmail.com> wrote:
>> Hi Ehsan.
>> Which version of scikit-learn are you using?
>> And why do you think the vocabulary size is 1860?
>> What is len(tf.vocabulary_)?
>> 
>> Andy
>> 
>>> On 11/18/2015 11:45 PM, Ehsan Asgari wrote:
>>> Hi,
>>> 
>>> I am using TfidfVectorizer of sklearn.feature_extraction.text for 
>>> generating tf-idf matrix of a corpus. However, when I look at the features 
>>> extracted from my corpus it seems that it has reduced my vocabulary size 
>>> from 1860 to 598! I tried to play with max_df, min_df, and max_features. 
>>> But nothing changed.
>>> 
>>> tf = TfidfVectorizer(ngram_range=(1,ngram),use_idf=False)
>>> tf_matrix =  tf.fit_transform(corpus)
>>> feature_names = tf.get_feature_names()
>>> Does someone have an idea how to solve this problem?
>>> 
>>> Thank you,
>>> 
>>> Ehsan
>>> 
>>> 
>>> 
>>> 
>>> ------------------------------------------------------------------------------
>>> 
>>> 
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> Scikit-learn-general@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>> 
>> 
>> ------------------------------------------------------------------------------
>> 
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> 
> 
> 
> -- 
> _____________________________________
> 
> PhD Candidate in Neuroscience | UC Berkeley
> Editor and Web Director | Berkeley Science Review
> _____________________________________
> ------------------------------------------------------------------------------
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to