http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

Check "token_pattern" in the signature

On 19 November 2015 at 12:14, Ehsan Asgari <asg...@berkeley.edu> wrote:

> Oh ok thank you. How can I check regex of sklearn and modify it?
>
>
>
>
> On Thu, Nov 19, 2015 at 11:40 AM, Andreas Mueller <t3k...@gmail.com>
> wrote:
>
>> Yeah but if the regexp is different you will get different results.
>>
>>
>>
>> On 11/19/2015 02:20 PM, Ehsan Asgari wrote:
>>
>> No, but actually there is no punctuation in my text, only space between
>> terms.
>>
>> Best,
>> Ehsan
>>
>>
>> On Thu, Nov 19, 2015 at 11:13 AM, Fred Mailhot <fred.mail...@gmail.com>
>> wrote:
>>
>>> Have you checked that your other program tokenizes the same way as the
>>> default sklearn tokenization?
>>>
>>>
>>> On 19 November 2015 at 11:09, Ehsan Asgari < <asg...@berkeley.edu>
>>> asg...@berkeley.edu> wrote:
>>>
>>>> Hi,
>>>>
>>>> Thank you, but it didn't work.
>>>> I checked  len(tf.vocabulary_) and it is also 1900 instead of 1914.
>>>> I have another program that counts distinct terms and it is 1914 there.
>>>>
>>>> Best,
>>>> Ehsan
>>>>
>>>>
>>>>
>>>> On Thu, Nov 19, 2015 at 9:36 AM, Andreas Mueller <t3k...@gmail.com>
>>>> wrote:
>>>>
>>>>> You should set min_df=1 and max_df=1.0 (which should be the default,
>>>>> but it depends on your scikit-learn version).
>>>>> How did you determine that your vocabulary size should be 1860?
>>>>>
>>>>>
>>>>>
>>>>> On 11/19/2015 12:31 PM, Ehsan Asgari wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> Thank you for your reply. I changed my delimeters from tab to space
>>>>> and most of the problem has been solved (1900 index term from 1914).
>>>>> However, still there are few words that are excluded. I didn't set any
>>>>> parameter as you can see in the code.
>>>>>
>>>>> tf = TfidfVectorizer(ngram_range=(1,ngram),use_idf=False)
>>>>>> tf_matrix =  tf.fit_transform(corpus)
>>>>>> feature_names = tf.get_feature_names()
>>>>>>
>>>>>> Should I play with the min_df and max_df?
>>>>>
>>>>> Best,
>>>>> Ehsan
>>>>>
>>>>> On Nov 19, 2015, at 9:01 AM, Chris Holdgraf < <choldg...@berkeley.edu>
>>>>> choldg...@berkeley.edu> wrote:
>>>>>
>>>>> If you vocab is indeed being cut down, could it be because some words
>>>>> don't pass through the word frequency cutoff filters? (min_df, max_df)
>>>>>
>>>>> On Thu, Nov 19, 2015 at 8:55 AM, Andreas Mueller < <t3k...@gmail.com>
>>>>> t3k...@gmail.com> wrote:
>>>>>
>>>>>> Hi Ehsan.
>>>>>> Which version of scikit-learn are you using?
>>>>>> And why do you think the vocabulary size is 1860?
>>>>>> What is len(tf.vocabulary_)?
>>>>>>
>>>>>> Andy
>>>>>>
>>>>>> On 11/18/2015 11:45 PM, Ehsan Asgari wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I am using TfidfVectorizer of sklearn.feature_extraction.text for
>>>>>> generating tf-idf matrix of a corpus. However, when I look at the 
>>>>>> features
>>>>>> extracted from my corpus it seems that it has reduced my vocabulary size
>>>>>> from 1860 to 598! I tried to play with max_df, min_df, and max_features.
>>>>>> But nothing changed.
>>>>>>
>>>>>> tf = TfidfVectorizer(ngram_range=(1,ngram),use_idf=False)
>>>>>> tf_matrix =  tf.fit_transform(corpus)
>>>>>> feature_names = tf.get_feature_names()
>>>>>>
>>>>>> Does someone have an idea how to solve this problem?
>>>>>>
>>>>>> Thank you,
>>>>>>
>>>>>> Ehsan
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> ------------------------------------------------------------------------------
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Scikit-learn-general mailing 
>>>>>> listScikit-learn-general@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> ------------------------------------------------------------------------------
>>>>>>
>>>>>> _______________________________________________
>>>>>> Scikit-learn-general mailing list
>>>>>> <Scikit-learn-general@lists.sourceforge.net>
>>>>>> Scikit-learn-general@lists.sourceforge.net
>>>>>> <https://lists.sourceforge.net/lists/listinfo/scikit-learn-general>
>>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> _____________________________________
>>>>>
>>>>> PhD Candidate in Neuroscience | UC Berkeley <http://hwni.org/>
>>>>> Editor and Web Director | Berkeley Science Review
>>>>> <http://sciencereview.berkeley.edu/>
>>>>> _____________________________________
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------------------
>>>>>
>>>>> _______________________________________________
>>>>> Scikit-learn-general mailing list
>>>>> <Scikit-learn-general@lists.sourceforge.net>
>>>>> Scikit-learn-general@lists.sourceforge.net
>>>>> <https://lists.sourceforge.net/lists/listinfo/scikit-learn-general>
>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------------------
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Scikit-learn-general mailing 
>>>>> listScikit-learn-general@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------------------
>>>>>
>>>>> _______________________________________________
>>>>> Scikit-learn-general mailing list
>>>>> Scikit-learn-general@lists.sourceforge.net
>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>
>>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>>
>>>> _______________________________________________
>>>> Scikit-learn-general mailing list
>>>> Scikit-learn-general@lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>
>>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>>
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> Scikit-learn-general@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>
>>>
>>
>>
>> ------------------------------------------------------------------------------
>>
>>
>>
>> _______________________________________________
>> Scikit-learn-general mailing 
>> listScikit-learn-general@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>>
>>
>> ------------------------------------------------------------------------------
>>
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>
>
> ------------------------------------------------------------------------------
>
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to