If you vocab is indeed being cut down, could it be because some words don't
pass through the word frequency cutoff filters? (min_df, max_df)

On Thu, Nov 19, 2015 at 8:55 AM, Andreas Mueller <t3k...@gmail.com> wrote:

> Hi Ehsan.
> Which version of scikit-learn are you using?
> And why do you think the vocabulary size is 1860?
> What is len(tf.vocabulary_)?
>
> Andy
>
> On 11/18/2015 11:45 PM, Ehsan Asgari wrote:
>
> Hi,
>
> I am using TfidfVectorizer of sklearn.feature_extraction.text for
> generating tf-idf matrix of a corpus. However, when I look at the features
> extracted from my corpus it seems that it has reduced my vocabulary size
> from 1860 to 598! I tried to play with max_df, min_df, and max_features.
> But nothing changed.
>
> tf = TfidfVectorizer(ngram_range=(1,ngram),use_idf=False)
> tf_matrix =  tf.fit_transform(corpus)
> feature_names = tf.get_feature_names()
>
> Does someone have an idea how to solve this problem?
>
> Thank you,
>
> Ehsan
>
>
>
>
> ------------------------------------------------------------------------------
>
>
>
> _______________________________________________
> Scikit-learn-general mailing 
> listScikit-learn-general@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
>
>
> ------------------------------------------------------------------------------
>
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>


-- 
_____________________________________

PhD Candidate in Neuroscience | UC Berkeley <http://hwni.org/>
Editor and Web Director | Berkeley Science Review
<http://sciencereview.berkeley.edu/>
_____________________________________
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to