Re: [Scikit-learn-general] TF-Idf

Ark Mon, 22 Oct 2012 12:07:50 -0700

> I don't see the number of non-zeros: could you please do:
> 
> >>> print vectorizer.transform([my_text_document])
> 
> as I asked previously? The run time should be linear with the number
> of non zeros.
--------------------------------------------


ipdb> print self.vectorizer.transform([doc])
  (0, 687)      0.0303117660218
  (0, 1145)     0.0636126446646
  (0, 1146)     0.0303117660218
  (0, 2471)     0.0303117660218
  (0, 4454)     0.0303117660218
  (0, 4468)     0.0513222811776
  (0, 4504)     0.0846231598204
  (0, 4505)     0.0846231598204
  (0, 4556)     0.0303117660218
  (0, 4565)     0.0303117660218
  (0, 5256)     0.0513222811776
  (0, 5257)     0.0513222811776
  (0, 6183)     0.0636126446646
  (0, 6184)     0.0303117660218
  (0, 6187)     0.0303117660218
  (0, 8034)     0.0513222811776
  (0, 9425)     0.0303117660218
  (0, 9443)     0.0303117660218
  (0, 10363)    0.0303117660218
  (0, 10368)    0.0513222811776
  (0, 10569)    0.0303117660218
  (0, 10635)    0.0513222811776
  (0, 10644)    0.0303117660218
  (0, 11971)    0.0723327963334
  (0, 11975)    0.0636126446646
  :     :
  (0, 185670)   0.0303117660218
  (0, 186664)   0.0303117660218
  (0, 187206)   0.0636126446646
  (0, 187233)   0.0303117660218
  (0, 188991)   0.0303117660218
  (0, 189088)   0.0303117660218
  (0, 191192)   0.0513222811776
  (0, 191907)   0.0513222811776
  (0, 192429)   0.0303117660218
  (0, 192431)   0.0303117660218
  (0, 192524)   0.0636126446646
  (0, 192549)   0.0513222811776
  (0, 193044)   0.0303117660218
  (0, 193225)   0.0723327963334
  (0, 193239)   0.0790966714502
  (0, 193240)   0.0790966714502
  (0, 194837)   0.0303117660218
  (0, 195783)   0.0303117660218
  (0, 198535)   0.0303117660218
  (0, 198889)   0.0790966714502
  (0, 199159)   0.0303117660218
  (0, 199189)   0.0303117660218
  (0, 199195)   0.0303117660218
  (0, 199310)   0.0303117660218
  (0, 199311)   0.0303117660218


 For reference, on my machine I have the following timing:
> 
> In [5]: from sklearn.datasets import fetch_20newsgroups
> 
> In [6]: from sklearn.feature_extraction.text import CountVectorizer
> 
> In [7]: twenty = fetch_20newsgroups()
> 
> In [8]: %time X = CountVectorizer().fit_transform(twenty.data)
> CPU times: user 12.14 s, sys: 0.66 s, total: 12.80 s
> Wall time: 13.12 s
> 
> In [9]: X
> Out[9]:
> <11314x56436 sparse matrix of type '<type 'numpy.int64'>'
>       with 1713894 stored elements in COOrdinate format>
> 
On my machine:

In [1]: from sklearn.datasets import fetch_20newsgroups

In [2]: from sklearn.feature_extraction.text import CountVectorizer

In [3]: twenty = fetch_20newsgroups()

In [4]: %time X = CountVectorizer().fit_transform(twenty.data)
CPU times: user 10.68 s, sys: 0.14 s, total: 10.82 s
Wall time: 10.82 s

In [5]: X
Out[5]: 
<11314x56431 sparse matrix of type '<type 'numpy.int64'>'
        with 1713896 stored elements in COOrdinate format>
-----------------------------------------------------------------------
In [6]: %time X = TfidfVectorizer().fit_transform(twenty.data)
CPU times: user 11.58 s, sys: 0.01 s, total: 11.60 s
Wall time: 11.61 s

In [7]: X
Out[7]: 
<11314x56431 sparse matrix of type '<type 'numpy.float64'>'
        with 1713896 stored elements in Compressed Sparse Row format>



------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_sfd2d_oct
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] TF-Idf

Reply via email to