If you vocab is indeed being cut down, could it be because some words don't pass through the word frequency cutoff filters? (min_df, max_df)
On Thu, Nov 19, 2015 at 8:55 AM, Andreas Mueller <t3k...@gmail.com> wrote: > Hi Ehsan. > Which version of scikit-learn are you using? > And why do you think the vocabulary size is 1860? > What is len(tf.vocabulary_)? > > Andy > > On 11/18/2015 11:45 PM, Ehsan Asgari wrote: > > Hi, > > I am using TfidfVectorizer of sklearn.feature_extraction.text for > generating tf-idf matrix of a corpus. However, when I look at the features > extracted from my corpus it seems that it has reduced my vocabulary size > from 1860 to 598! I tried to play with max_df, min_df, and max_features. > But nothing changed. > > tf = TfidfVectorizer(ngram_range=(1,ngram),use_idf=False) > tf_matrix = tf.fit_transform(corpus) > feature_names = tf.get_feature_names() > > Does someone have an idea how to solve this problem? > > Thank you, > > Ehsan > > > > > ------------------------------------------------------------------------------ > > > > _______________________________________________ > Scikit-learn-general mailing > listScikit-learn-general@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/scikit-learn-general > > > > > ------------------------------------------------------------------------------ > > _______________________________________________ > Scikit-learn-general mailing list > Scikit-learn-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > > -- _____________________________________ PhD Candidate in Neuroscience | UC Berkeley <http://hwni.org/> Editor and Web Director | Berkeley Science Review <http://sciencereview.berkeley.edu/> _____________________________________
------------------------------------------------------------------------------
_______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general