Re: [Scikit-learn-general] [TfidfVectorizer problem]

Andreas Mueller Thu, 19 Nov 2015 09:38:32 -0800

You should set min_df=1 and max_df=1.0 (which should be the default, butit depends on your scikit-learn version).

How did you determine that your vocabulary size should be 1860?



On 11/19/2015 12:31 PM, Ehsan Asgari wrote:

Hi,

Thank you for your reply. I changed my delimeters from tab to spaceand most of the problem has been solved (1900 index term from 1914).However, still there are few words that are excluded. I didn't set anyparameter as you can see in the code.

    |tf = TfidfVectorizer(ngram_range=(1,ngram),use_idf=False)
    tf_matrix = tf.fit_transform(corpus) feature_names =
    tf.get_feature_names()|

Should I play with the min_df and max_df?

Best,
Ehsan

On Nov 19, 2015, at 9:01 AM, Chris Holdgraf <choldg...@berkeley.edu<mailto:choldg...@berkeley.edu>> wrote:

If you vocab is indeed being cut down, could it be because some wordsdon't pass through the word frequency cutoff filters? (min_df, max_df)

On Thu, Nov 19, 2015 at 8:55 AM, Andreas Mueller <t3k...@gmail.com<mailto:t3k...@gmail.com>> wrote:


    Hi Ehsan.
    Which version of scikit-learn are you using?
    And why do you think the vocabulary size is 1860?
    What is len(tf.vocabulary_)?

    Andy

    On 11/18/2015 11:45 PM, Ehsan Asgari wrote:


    Hi,

    I am using TfidfVectorizer of sklearn.feature_extraction.text
    for generating tf-idf matrix of a corpus. However, when I look
    at the features extracted from my corpus it seems that it has
    reduced my vocabulary size from 1860 to 598! I tried to play
    with max_df, min_df, and max_features. But nothing changed.

    |tf = TfidfVectorizer(ngram_range=(1,ngram),use_idf=False)
    tf_matrix = tf.fit_transform(corpus) feature_names =
    tf.get_feature_names() |

    Does someone have an idea how to solve this problem?

    Thank you,

    Ehsan




    
------------------------------------------------------------------------------


    _______________________________________________
    Scikit-learn-general mailing list
    Scikit-learn-general@lists.sourceforge.net
    <mailto:Scikit-learn-general@lists.sourceforge.net>
    https://lists.sourceforge.net/lists/listinfo/scikit-learn-general



    
------------------------------------------------------------------------------

    _______________________________________________
    Scikit-learn-general mailing list
    Scikit-learn-general@lists.sourceforge.net
    <mailto:Scikit-learn-general@lists.sourceforge.net>
    https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




--
_____________________________________

PhD Candidate in Neuroscience | UC Berkeley <http://hwni.org/>

Editor and Web Director | Berkeley Science Review<http://sciencereview.berkeley.edu/>

_____________________________________
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list

Scikit-learn-general@lists.sourceforge.net<mailto:Scikit-learn-general@lists.sourceforge.net>

https://lists.sourceforge.net/lists/listinfo/scikit-learn-general



------------------------------------------------------------------------------


_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------

_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] [TfidfVectorizer problem]

Reply via email to