Re: [Scikit-learn-general] TF-Idf

Olivier Grisel Tue, 23 Oct 2012 00:54:57 -0700

2012/10/22 Ark <[email protected]>:
> e if it's related to IDF normalization?
>>
>> How many dimensions do you have in your fitted model?
>>
>> >>> print len(vectorizer.vocabulary_)
>>
>> How many documents do you have in your training corpus?
>>
>> How many non-zeros do you have in your transformed document?
>>
>> >>> print vectorizer.transform([my_text_document])
>
>
> In [30]: print vectorizer.transform([input_txt]).data.shape
> (110,)


You just have 110 words in a 50kB email document? This is weird. Maybe
there is something special in the formatting of your data that makes
one element of the vectorizer run unusually slow. It's hard to tell
without a having access to a dataset that exhibits this performance
issue.

Try to have a manual look a the content of the 20 news groups corpus:

>>> twenty = twenty = fetch_20newsgroups()
>>> print twenty.data[0]
>>> print twenty.data[1]
...

And see of the text of the documents of the twenty newsgroups data
which is fast to vectorize is very different from the content of your
documents that are slow to vectorize.

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_sfd2d_oct
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] TF-Idf

Reply via email to