> I don't see the number of non-zeros: could you please do:
>
> >>> print vectorizer.transform([my_text_document])
>
> as I asked previously? The run time should be linear with the number
> of non zeros.
--------------------------------------------
ipdb> print self.vectorizer.transform([doc])
(0, 687) 0.0303117660218
(0, 1145) 0.0636126446646
(0, 1146) 0.0303117660218
(0, 2471) 0.0303117660218
(0, 4454) 0.0303117660218
(0, 4468) 0.0513222811776
(0, 4504) 0.0846231598204
(0, 4505) 0.0846231598204
(0, 4556) 0.0303117660218
(0, 4565) 0.0303117660218
(0, 5256) 0.0513222811776
(0, 5257) 0.0513222811776
(0, 6183) 0.0636126446646
(0, 6184) 0.0303117660218
(0, 6187) 0.0303117660218
(0, 8034) 0.0513222811776
(0, 9425) 0.0303117660218
(0, 9443) 0.0303117660218
(0, 10363) 0.0303117660218
(0, 10368) 0.0513222811776
(0, 10569) 0.0303117660218
(0, 10635) 0.0513222811776
(0, 10644) 0.0303117660218
(0, 11971) 0.0723327963334
(0, 11975) 0.0636126446646
: :
(0, 185670) 0.0303117660218
(0, 186664) 0.0303117660218
(0, 187206) 0.0636126446646
(0, 187233) 0.0303117660218
(0, 188991) 0.0303117660218
(0, 189088) 0.0303117660218
(0, 191192) 0.0513222811776
(0, 191907) 0.0513222811776
(0, 192429) 0.0303117660218
(0, 192431) 0.0303117660218
(0, 192524) 0.0636126446646
(0, 192549) 0.0513222811776
(0, 193044) 0.0303117660218
(0, 193225) 0.0723327963334
(0, 193239) 0.0790966714502
(0, 193240) 0.0790966714502
(0, 194837) 0.0303117660218
(0, 195783) 0.0303117660218
(0, 198535) 0.0303117660218
(0, 198889) 0.0790966714502
(0, 199159) 0.0303117660218
(0, 199189) 0.0303117660218
(0, 199195) 0.0303117660218
(0, 199310) 0.0303117660218
(0, 199311) 0.0303117660218
For reference, on my machine I have the following timing:
>
> In [5]: from sklearn.datasets import fetch_20newsgroups
>
> In [6]: from sklearn.feature_extraction.text import CountVectorizer
>
> In [7]: twenty = fetch_20newsgroups()
>
> In [8]: %time X = CountVectorizer().fit_transform(twenty.data)
> CPU times: user 12.14 s, sys: 0.66 s, total: 12.80 s
> Wall time: 13.12 s
>
> In [9]: X
> Out[9]:
> <11314x56436 sparse matrix of type '<type 'numpy.int64'>'
> with 1713894 stored elements in COOrdinate format>
>
On my machine:
In [1]: from sklearn.datasets import fetch_20newsgroups
In [2]: from sklearn.feature_extraction.text import CountVectorizer
In [3]: twenty = fetch_20newsgroups()
In [4]: %time X = CountVectorizer().fit_transform(twenty.data)
CPU times: user 10.68 s, sys: 0.14 s, total: 10.82 s
Wall time: 10.82 s
In [5]: X
Out[5]:
<11314x56431 sparse matrix of type '<type 'numpy.int64'>'
with 1713896 stored elements in COOrdinate format>
-----------------------------------------------------------------------
In [6]: %time X = TfidfVectorizer().fit_transform(twenty.data)
CPU times: user 11.58 s, sys: 0.01 s, total: 11.60 s
Wall time: 11.61 s
In [7]: X
Out[7]:
<11314x56431 sparse matrix of type '<type 'numpy.float64'>'
with 1713896 stored elements in Compressed Sparse Row format>
------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_sfd2d_oct
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general