Yeah but if the regexp is different you will get different results.
On 11/19/2015 02:20 PM, Ehsan Asgari wrote:
No, but actually there is no punctuation in my text, only space
between terms.
Best,
Ehsan
On Thu, Nov 19, 2015 at 11:13 AM, Fred Mailhot <fred.mail...@gmail.com
<mailto:fred.mail...@gmail.com>> wrote:
Have you checked that your other program tokenizes the same way as
the default sklearn tokenization?
On 19 November 2015 at 11:09, Ehsan Asgari <asg...@berkeley.edu
<mailto:asg...@berkeley.edu>> wrote:
Hi,
Thank you, but it didn't work.
I checked len(tf.vocabulary_) and it is also 1900 instead of
1914.
I have another program that counts distinct terms and it is
1914 there.
Best,
Ehsan
On Thu, Nov 19, 2015 at 9:36 AM, Andreas Mueller
<t3k...@gmail.com <mailto:t3k...@gmail.com>> wrote:
You should set min_df=1 and max_df=1.0 (which should be
the default, but it depends on your scikit-learn version).
How did you determine that your vocabulary size should be
1860?
On 11/19/2015 12:31 PM, Ehsan Asgari wrote:
Hi,
Thank you for your reply. I changed my delimeters from
tab to space and most of the problem has been solved
(1900 index term from 1914). However, still there are few
words that are excluded. I didn't set any parameter as
you can see in the code.
|tf =
TfidfVectorizer(ngram_range=(1,ngram),use_idf=False) tf_matrix
= tf.fit_transform(corpus) feature_names =
tf.get_feature_names()|
Should I play with the min_df and max_df?
Best,
Ehsan
On Nov 19, 2015, at 9:01 AM, Chris Holdgraf
<choldg...@berkeley.edu <mailto:choldg...@berkeley.edu>>
wrote:
If you vocab is indeed being cut down, could it be
because some words don't pass through the word frequency
cutoff filters? (min_df, max_df)
On Thu, Nov 19, 2015 at 8:55 AM, Andreas Mueller
<t3k...@gmail.com <mailto:t3k...@gmail.com>> wrote:
Hi Ehsan.
Which version of scikit-learn are you using?
And why do you think the vocabulary size is 1860?
What is len(tf.vocabulary_)?
Andy
On 11/18/2015 11:45 PM, Ehsan Asgari wrote:
Hi,
I am using TfidfVectorizer of
sklearn.feature_extraction.text for generating
tf-idf matrix of a corpus. However, when I look at
the features extracted from my corpus it seems that
it has reduced my vocabulary size from 1860 to 598!
I tried to play with max_df, min_df, and
max_features. But nothing changed.
|tf =
TfidfVectorizer(ngram_range=(1,ngram),use_idf=False) tf_matrix
= tf.fit_transform(corpus) feature_names =
tf.get_feature_names() |
Does someone have an idea how to solve this problem?
Thank you,
Ehsan
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
<mailto:Scikit-learn-general@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
<mailto:Scikit-learn-general@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
_____________________________________
PhD Candidate in Neuroscience | UC Berkeley
<http://hwni.org/>
Editor and Web Director | Berkeley Science Review
<http://sciencereview.berkeley.edu/>
_____________________________________
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
<mailto:Scikit-learn-general@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
<mailto:Scikit-learn-general@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
<mailto:Scikit-learn-general@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
<mailto:Scikit-learn-general@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
<mailto:Scikit-learn-general@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general