http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
Check "token_pattern" in the signature On 19 November 2015 at 12:14, Ehsan Asgari <asg...@berkeley.edu> wrote: > Oh ok thank you. How can I check regex of sklearn and modify it? > > > > > On Thu, Nov 19, 2015 at 11:40 AM, Andreas Mueller <t3k...@gmail.com> > wrote: > >> Yeah but if the regexp is different you will get different results. >> >> >> >> On 11/19/2015 02:20 PM, Ehsan Asgari wrote: >> >> No, but actually there is no punctuation in my text, only space between >> terms. >> >> Best, >> Ehsan >> >> >> On Thu, Nov 19, 2015 at 11:13 AM, Fred Mailhot <fred.mail...@gmail.com> >> wrote: >> >>> Have you checked that your other program tokenizes the same way as the >>> default sklearn tokenization? >>> >>> >>> On 19 November 2015 at 11:09, Ehsan Asgari < <asg...@berkeley.edu> >>> asg...@berkeley.edu> wrote: >>> >>>> Hi, >>>> >>>> Thank you, but it didn't work. >>>> I checked len(tf.vocabulary_) and it is also 1900 instead of 1914. >>>> I have another program that counts distinct terms and it is 1914 there. >>>> >>>> Best, >>>> Ehsan >>>> >>>> >>>> >>>> On Thu, Nov 19, 2015 at 9:36 AM, Andreas Mueller <t3k...@gmail.com> >>>> wrote: >>>> >>>>> You should set min_df=1 and max_df=1.0 (which should be the default, >>>>> but it depends on your scikit-learn version). >>>>> How did you determine that your vocabulary size should be 1860? >>>>> >>>>> >>>>> >>>>> On 11/19/2015 12:31 PM, Ehsan Asgari wrote: >>>>> >>>>> Hi, >>>>> >>>>> Thank you for your reply. I changed my delimeters from tab to space >>>>> and most of the problem has been solved (1900 index term from 1914). >>>>> However, still there are few words that are excluded. I didn't set any >>>>> parameter as you can see in the code. >>>>> >>>>> tf = TfidfVectorizer(ngram_range=(1,ngram),use_idf=False) >>>>>> tf_matrix = tf.fit_transform(corpus) >>>>>> feature_names = tf.get_feature_names() >>>>>> >>>>>> Should I play with the min_df and max_df? >>>>> >>>>> Best, >>>>> Ehsan >>>>> >>>>> On Nov 19, 2015, at 9:01 AM, Chris Holdgraf < <choldg...@berkeley.edu> >>>>> choldg...@berkeley.edu> wrote: >>>>> >>>>> If you vocab is indeed being cut down, could it be because some words >>>>> don't pass through the word frequency cutoff filters? (min_df, max_df) >>>>> >>>>> On Thu, Nov 19, 2015 at 8:55 AM, Andreas Mueller < <t3k...@gmail.com> >>>>> t3k...@gmail.com> wrote: >>>>> >>>>>> Hi Ehsan. >>>>>> Which version of scikit-learn are you using? >>>>>> And why do you think the vocabulary size is 1860? >>>>>> What is len(tf.vocabulary_)? >>>>>> >>>>>> Andy >>>>>> >>>>>> On 11/18/2015 11:45 PM, Ehsan Asgari wrote: >>>>>> >>>>>> Hi, >>>>>> >>>>>> I am using TfidfVectorizer of sklearn.feature_extraction.text for >>>>>> generating tf-idf matrix of a corpus. However, when I look at the >>>>>> features >>>>>> extracted from my corpus it seems that it has reduced my vocabulary size >>>>>> from 1860 to 598! I tried to play with max_df, min_df, and max_features. >>>>>> But nothing changed. >>>>>> >>>>>> tf = TfidfVectorizer(ngram_range=(1,ngram),use_idf=False) >>>>>> tf_matrix = tf.fit_transform(corpus) >>>>>> feature_names = tf.get_feature_names() >>>>>> >>>>>> Does someone have an idea how to solve this problem? >>>>>> >>>>>> Thank you, >>>>>> >>>>>> Ehsan >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> ------------------------------------------------------------------------------ >>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> Scikit-learn-general mailing >>>>>> listScikit-learn-general@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/scikit-learn-general >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> ------------------------------------------------------------------------------ >>>>>> >>>>>> _______________________________________________ >>>>>> Scikit-learn-general mailing list >>>>>> <Scikit-learn-general@lists.sourceforge.net> >>>>>> Scikit-learn-general@lists.sourceforge.net >>>>>> <https://lists.sourceforge.net/lists/listinfo/scikit-learn-general> >>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> _____________________________________ >>>>> >>>>> PhD Candidate in Neuroscience | UC Berkeley <http://hwni.org/> >>>>> Editor and Web Director | Berkeley Science Review >>>>> <http://sciencereview.berkeley.edu/> >>>>> _____________________________________ >>>>> >>>>> >>>>> ------------------------------------------------------------------------------ >>>>> >>>>> _______________________________________________ >>>>> Scikit-learn-general mailing list >>>>> <Scikit-learn-general@lists.sourceforge.net> >>>>> Scikit-learn-general@lists.sourceforge.net >>>>> <https://lists.sourceforge.net/lists/listinfo/scikit-learn-general> >>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >>>>> >>>>> >>>>> >>>>> ------------------------------------------------------------------------------ >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> Scikit-learn-general mailing >>>>> listScikit-learn-general@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/scikit-learn-general >>>>> >>>>> >>>>> >>>>> >>>>> ------------------------------------------------------------------------------ >>>>> >>>>> _______________________________________________ >>>>> Scikit-learn-general mailing list >>>>> Scikit-learn-general@lists.sourceforge.net >>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >>>>> >>>>> >>>> >>>> >>>> ------------------------------------------------------------------------------ >>>> >>>> _______________________________________________ >>>> Scikit-learn-general mailing list >>>> Scikit-learn-general@lists.sourceforge.net >>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >>>> >>>> >>> >>> >>> ------------------------------------------------------------------------------ >>> >>> _______________________________________________ >>> Scikit-learn-general mailing list >>> Scikit-learn-general@lists.sourceforge.net >>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >>> >>> >> >> >> ------------------------------------------------------------------------------ >> >> >> >> _______________________________________________ >> Scikit-learn-general mailing >> listScikit-learn-general@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/scikit-learn-general >> >> >> >> >> ------------------------------------------------------------------------------ >> >> _______________________________________________ >> Scikit-learn-general mailing list >> Scikit-learn-general@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >> >> > > > ------------------------------------------------------------------------------ > > _______________________________________________ > Scikit-learn-general mailing list > Scikit-learn-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > >
------------------------------------------------------------------------------
_______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general