Re: [Scikit-learn-general] [TfidfVectorizer problem]

Ehsan Asgari Thu, 19 Nov 2015 11:22:57 -0800

No, but actually there is no punctuation in my text, only space between
terms.


Best,
Ehsan


On Thu, Nov 19, 2015 at 11:13 AM, Fred Mailhot <[email protected]>
wrote:

> Have you checked that your other program tokenizes the same way as the
> default sklearn tokenization?
>
>
> On 19 November 2015 at 11:09, Ehsan Asgari <[email protected]> wrote:
>
>> Hi,
>>
>> Thank you, but it didn't work.
>> I checked  len(tf.vocabulary_) and it is also 1900 instead of 1914.
>> I have another program that counts distinct terms and it is 1914 there.
>>
>> Best,
>> Ehsan
>>
>>
>>
>> On Thu, Nov 19, 2015 at 9:36 AM, Andreas Mueller <[email protected]>
>> wrote:
>>
>>> You should set min_df=1 and max_df=1.0 (which should be the default, but
>>> it depends on your scikit-learn version).
>>> How did you determine that your vocabulary size should be 1860?
>>>
>>>
>>>
>>> On 11/19/2015 12:31 PM, Ehsan Asgari wrote:
>>>
>>> Hi,
>>>
>>> Thank you for your reply. I changed my delimeters from tab to space and
>>> most of the problem has been solved (1900 index term from 1914). However,
>>> still there are few words that are excluded. I didn't set any parameter as
>>> you can see in the code.
>>>
>>> tf = TfidfVectorizer(ngram_range=(1,ngram),use_idf=False)
>>>> tf_matrix =  tf.fit_transform(corpus)
>>>> feature_names = tf.get_feature_names()
>>>>
>>>> Should I play with the min_df and max_df?
>>>
>>> Best,
>>> Ehsan
>>>
>>> On Nov 19, 2015, at 9:01 AM, Chris Holdgraf < <[email protected]>
>>> [email protected]> wrote:
>>>
>>> If you vocab is indeed being cut down, could it be because some words
>>> don't pass through the word frequency cutoff filters? (min_df, max_df)
>>>
>>> On Thu, Nov 19, 2015 at 8:55 AM, Andreas Mueller < <[email protected]>
>>> [email protected]> wrote:
>>>
>>>> Hi Ehsan.
>>>> Which version of scikit-learn are you using?
>>>> And why do you think the vocabulary size is 1860?
>>>> What is len(tf.vocabulary_)?
>>>>
>>>> Andy
>>>>
>>>> On 11/18/2015 11:45 PM, Ehsan Asgari wrote:
>>>>
>>>> Hi,
>>>>
>>>> I am using TfidfVectorizer of sklearn.feature_extraction.text for
>>>> generating tf-idf matrix of a corpus. However, when I look at the features
>>>> extracted from my corpus it seems that it has reduced my vocabulary size
>>>> from 1860 to 598! I tried to play with max_df, min_df, and max_features.
>>>> But nothing changed.
>>>>
>>>> tf = TfidfVectorizer(ngram_range=(1,ngram),use_idf=False)
>>>> tf_matrix =  tf.fit_transform(corpus)
>>>> feature_names = tf.get_feature_names()
>>>>
>>>> Does someone have an idea how to solve this problem?
>>>>
>>>> Thank you,
>>>>
>>>> Ehsan
>>>>
>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Scikit-learn-general mailing 
>>>> [email protected]https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>
>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>>
>>>> _______________________________________________
>>>> Scikit-learn-general mailing list
>>>> [email protected]
>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>
>>>>
>>>
>>>
>>> --
>>> _____________________________________
>>>
>>> PhD Candidate in Neuroscience | UC Berkeley <http://hwni.org/>
>>> Editor and Web Director | Berkeley Science Review
>>> <http://sciencereview.berkeley.edu/>
>>> _____________________________________
>>>
>>>
>>> ------------------------------------------------------------------------------
>>>
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>>
>>>
>>>
>>> _______________________________________________
>>> Scikit-learn-general mailing 
>>> [email protected]https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>
>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>>
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>
>>>
>>
>>
>> ------------------------------------------------------------------------------
>>
>> _______________________________________________
>> Scikit-learn-general mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>
>
> ------------------------------------------------------------------------------
>
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>

------------------------------------------------------------------------------

_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] [TfidfVectorizer problem]

Reply via email to