Hi, Thank you, but it didn't work. I checked len(tf.vocabulary_) and it is also 1900 instead of 1914. I have another program that counts distinct terms and it is 1914 there.
Best, Ehsan On Thu, Nov 19, 2015 at 9:36 AM, Andreas Mueller <t3k...@gmail.com> wrote: > You should set min_df=1 and max_df=1.0 (which should be the default, but > it depends on your scikit-learn version). > How did you determine that your vocabulary size should be 1860? > > > > On 11/19/2015 12:31 PM, Ehsan Asgari wrote: > > Hi, > > Thank you for your reply. I changed my delimeters from tab to space and > most of the problem has been solved (1900 index term from 1914). However, > still there are few words that are excluded. I didn't set any parameter as > you can see in the code. > > tf = TfidfVectorizer(ngram_range=(1,ngram),use_idf=False) >> tf_matrix = tf.fit_transform(corpus) >> feature_names = tf.get_feature_names() >> >> Should I play with the min_df and max_df? > > Best, > Ehsan > > On Nov 19, 2015, at 9:01 AM, Chris Holdgraf < <choldg...@berkeley.edu> > choldg...@berkeley.edu> wrote: > > If you vocab is indeed being cut down, could it be because some words > don't pass through the word frequency cutoff filters? (min_df, max_df) > > On Thu, Nov 19, 2015 at 8:55 AM, Andreas Mueller < <t3k...@gmail.com> > t3k...@gmail.com> wrote: > >> Hi Ehsan. >> Which version of scikit-learn are you using? >> And why do you think the vocabulary size is 1860? >> What is len(tf.vocabulary_)? >> >> Andy >> >> On 11/18/2015 11:45 PM, Ehsan Asgari wrote: >> >> Hi, >> >> I am using TfidfVectorizer of sklearn.feature_extraction.text for >> generating tf-idf matrix of a corpus. However, when I look at the features >> extracted from my corpus it seems that it has reduced my vocabulary size >> from 1860 to 598! I tried to play with max_df, min_df, and max_features. >> But nothing changed. >> >> tf = TfidfVectorizer(ngram_range=(1,ngram),use_idf=False) >> tf_matrix = tf.fit_transform(corpus) >> feature_names = tf.get_feature_names() >> >> Does someone have an idea how to solve this problem? >> >> Thank you, >> >> Ehsan >> >> >> >> >> ------------------------------------------------------------------------------ >> >> >> >> _______________________________________________ >> Scikit-learn-general mailing >> listScikit-learn-general@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/scikit-learn-general >> >> >> >> >> ------------------------------------------------------------------------------ >> >> _______________________________________________ >> Scikit-learn-general mailing list >> Scikit-learn-general@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >> >> > > > -- > _____________________________________ > > PhD Candidate in Neuroscience | UC Berkeley <http://hwni.org/> > Editor and Web Director | Berkeley Science Review > <http://sciencereview.berkeley.edu/> > _____________________________________ > > > ------------------------------------------------------------------------------ > > _______________________________________________ > Scikit-learn-general mailing list > Scikit-learn-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > > > > ------------------------------------------------------------------------------ > > > > _______________________________________________ > Scikit-learn-general mailing > listScikit-learn-general@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/scikit-learn-general > > > > > ------------------------------------------------------------------------------ > > _______________________________________________ > Scikit-learn-general mailing list > Scikit-learn-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > >
------------------------------------------------------------------------------
_______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general