Yeah but if the regexp is different you will get different results.


On 11/19/2015 02:20 PM, Ehsan Asgari wrote:
No, but actually there is no punctuation in my text, only space between terms.

Best,
Ehsan


On Thu, Nov 19, 2015 at 11:13 AM, Fred Mailhot <fred.mail...@gmail.com <mailto:fred.mail...@gmail.com>> wrote:

    Have you checked that your other program tokenizes the same way as
    the default sklearn tokenization?


    On 19 November 2015 at 11:09, Ehsan Asgari <asg...@berkeley.edu
    <mailto:asg...@berkeley.edu>> wrote:

        Hi,

        Thank you, but it didn't work.
        I checked  len(tf.vocabulary_) and it is also 1900 instead of
        1914.
        I have another program that counts distinct terms and it is
        1914 there.

        Best,
        Ehsan



        On Thu, Nov 19, 2015 at 9:36 AM, Andreas Mueller
        <t3k...@gmail.com <mailto:t3k...@gmail.com>> wrote:

            You should set min_df=1 and max_df=1.0 (which should be
            the default, but it depends on your scikit-learn version).
            How did you determine that your vocabulary size should be
            1860?



            On 11/19/2015 12:31 PM, Ehsan Asgari wrote:
            Hi,

            Thank you for your reply. I changed my delimeters from
            tab to space and most of the problem has been solved
            (1900 index term from 1914). However, still there are few
            words that are excluded. I didn't set any parameter as
            you can see in the code.

                |tf =
                TfidfVectorizer(ngram_range=(1,ngram),use_idf=False) tf_matrix
                = tf.fit_transform(corpus) feature_names =
                tf.get_feature_names()|

            Should I play with the min_df and max_df?

            Best,
            Ehsan

            On Nov 19, 2015, at 9:01 AM, Chris Holdgraf
            <choldg...@berkeley.edu <mailto:choldg...@berkeley.edu>>
            wrote:

            If you vocab is indeed being cut down, could it be
            because some words don't pass through the word frequency
            cutoff filters? (min_df, max_df)

            On Thu, Nov 19, 2015 at 8:55 AM, Andreas Mueller
            <t3k...@gmail.com <mailto:t3k...@gmail.com>> wrote:

                Hi Ehsan.
                Which version of scikit-learn are you using?
                And why do you think the vocabulary size is 1860?
                What is len(tf.vocabulary_)?

                Andy

                On 11/18/2015 11:45 PM, Ehsan Asgari wrote:

                Hi,

                I am using TfidfVectorizer of
                sklearn.feature_extraction.text for generating
                tf-idf matrix of a corpus. However, when I look at
                the features extracted from my corpus it seems that
                it has reduced my vocabulary size from 1860 to 598!
                I tried to play with max_df, min_df, and
                max_features. But nothing changed.

                |tf =
                TfidfVectorizer(ngram_range=(1,ngram),use_idf=False) tf_matrix
                = tf.fit_transform(corpus) feature_names =
                tf.get_feature_names() |

                Does someone have an idea how to solve this problem?

                Thank you,

                Ehsan




                
------------------------------------------------------------------------------


                _______________________________________________
                Scikit-learn-general mailing list
                Scikit-learn-general@lists.sourceforge.net
                <mailto:Scikit-learn-general@lists.sourceforge.net>
                
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


                
------------------------------------------------------------------------------

                _______________________________________________
                Scikit-learn-general mailing list
                Scikit-learn-general@lists.sourceforge.net
                <mailto:Scikit-learn-general@lists.sourceforge.net>
                
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




-- _____________________________________

            PhD Candidate in Neuroscience | UC Berkeley
            <http://hwni.org/>
            Editor and Web Director | Berkeley Science Review
            <http://sciencereview.berkeley.edu/>
            _____________________________________
            
------------------------------------------------------------------------------
            _______________________________________________
            Scikit-learn-general mailing list
            Scikit-learn-general@lists.sourceforge.net
            <mailto:Scikit-learn-general@lists.sourceforge.net>
            https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


            
------------------------------------------------------------------------------


            _______________________________________________
            Scikit-learn-general mailing list
            Scikit-learn-general@lists.sourceforge.net
            <mailto:Scikit-learn-general@lists.sourceforge.net>
            https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


            
------------------------------------------------------------------------------

            _______________________________________________
            Scikit-learn-general mailing list
            Scikit-learn-general@lists.sourceforge.net
            <mailto:Scikit-learn-general@lists.sourceforge.net>
            https://lists.sourceforge.net/lists/listinfo/scikit-learn-general



        
------------------------------------------------------------------------------

        _______________________________________________
        Scikit-learn-general mailing list
        Scikit-learn-general@lists.sourceforge.net
        <mailto:Scikit-learn-general@lists.sourceforge.net>
        https://lists.sourceforge.net/lists/listinfo/scikit-learn-general



    
------------------------------------------------------------------------------

    _______________________________________________
    Scikit-learn-general mailing list
    Scikit-learn-general@lists.sourceforge.net
    <mailto:Scikit-learn-general@lists.sourceforge.net>
    https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




------------------------------------------------------------------------------


_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to