Hi,

I recently used `SGDClassifier`** in a setting (text classification)
where test time performance is critical and classification decisions
are made for a single data point at a time. After doing some profiling
I was quite surprised to find out that the performance bottleneck was
_not_ feature processessing (i.e. token filtering and hash
vectorization) but numpy/scipy - concretely, it was the dot product of
a sparse vector (scipy csr matrix) and a dense parameter matrix.
Consider the following example::

  import numpy as np
  from scipy import sparse as sp
  from sklearn.utils.extmath import safe_sparse_dot

  # this is the way SGDClassifier layouts its parameter vectors for 20
classes and 2**18 features
  W = np.random.rand(20, 2**18)
  x = np.zeros((2**18,), dtype=np.float64)
  x[np.random.randint(0, 2**18, size=7)] = 1.0
  x = sp.csr_matrix(x)
  x.sort_indices()

  %timeit safe_sparse_dot(x, W.T)

this gives the following striking results on my laptop:
10 loops, best of 3: 74 ms per loop   <--- 74 ms - that's a lot!

Now lets try a different layout of W::

  W = np.random.rand(2**18, 20)

  %timeit safe_sparse_dot(x, W)

which gives:
10000 loops, best of 3: 62.8 us per loop  <-- 62 micro seconds - if
I'm not mistaken that's a 3 orders of magnitude speedup! how's that
possible?

The take home message here is that choosing a memory layout that is
convenient at training time might bite us at test time...

It's also quite interesting if we consider more than one data point::

  X = np.zeros((1000, 2**18))
  for i in range(1000):
      X[i, np.random.randint(0, 2**18, size=7)] = 1.0
  X = sp.csr_matrix(X)

  W = np.random.rand(20, 2**18)
  %timeit safe_sparse_dot(X, W.T)

which gives roughly the same results as for one data point: 10 loops,
best of 3: 68.9 ms per loop - the reason why we haven't spotted the
issue in our benchmarks which make batch decisions.

It seems to me that dot products between a sparse matrix and a dense
array view trigger a memory copy of the view (which is roughly 70ms
``%timeitW.T.copy()`` ), thus we should be aware of such cases in our
code.

best,
 Peter

** this is not limited to SGDClassifier but applies also to NaiveBayes
and other estimators that use ``safe_sparse_dot(X, coef.T)``.

-- 
Peter Prettenhofer

------------------------------------------------------------------------------
Ridiculously easy VDI. With Citrix VDI-in-a-Box, you don't need a complex
infrastructure or vast IT resources to deliver seamless, secure access to
virtual desktops. With this all-in-one solution, easily deploy virtual 
desktops for less than the cost of PCs and save 60% on VDI infrastructure 
costs. Try it free! http://p.sf.net/sfu/Citrix-VDIinabox
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to