Hi,
I recently used `SGDClassifier`** in a setting (text classification)
where test time performance is critical and classification decisions
are made for a single data point at a time. After doing some profiling
I was quite surprised to find out that the performance bottleneck was
_not_ feature processessing (i.e. token filtering and hash
vectorization) but numpy/scipy - concretely, it was the dot product of
a sparse vector (scipy csr matrix) and a dense parameter matrix.
Consider the following example::
import numpy as np
from scipy import sparse as sp
from sklearn.utils.extmath import safe_sparse_dot
# this is the way SGDClassifier layouts its parameter vectors for 20
classes and 2**18 features
W = np.random.rand(20, 2**18)
x = np.zeros((2**18,), dtype=np.float64)
x[np.random.randint(0, 2**18, size=7)] = 1.0
x = sp.csr_matrix(x)
x.sort_indices()
%timeit safe_sparse_dot(x, W.T)
this gives the following striking results on my laptop:
10 loops, best of 3: 74 ms per loop <--- 74 ms - that's a lot!
Now lets try a different layout of W::
W = np.random.rand(2**18, 20)
%timeit safe_sparse_dot(x, W)
which gives:
10000 loops, best of 3: 62.8 us per loop <-- 62 micro seconds - if
I'm not mistaken that's a 3 orders of magnitude speedup! how's that
possible?
The take home message here is that choosing a memory layout that is
convenient at training time might bite us at test time...
It's also quite interesting if we consider more than one data point::
X = np.zeros((1000, 2**18))
for i in range(1000):
X[i, np.random.randint(0, 2**18, size=7)] = 1.0
X = sp.csr_matrix(X)
W = np.random.rand(20, 2**18)
%timeit safe_sparse_dot(X, W.T)
which gives roughly the same results as for one data point: 10 loops,
best of 3: 68.9 ms per loop - the reason why we haven't spotted the
issue in our benchmarks which make batch decisions.
It seems to me that dot products between a sparse matrix and a dense
array view trigger a memory copy of the view (which is roughly 70ms
``%timeitW.T.copy()`` ), thus we should be aware of such cases in our
code.
best,
Peter
** this is not limited to SGDClassifier but applies also to NaiveBayes
and other estimators that use ``safe_sparse_dot(X, coef.T)``.
--
Peter Prettenhofer
------------------------------------------------------------------------------
Ridiculously easy VDI. With Citrix VDI-in-a-Box, you don't need a complex
infrastructure or vast IT resources to deliver seamless, secure access to
virtual desktops. With this all-in-one solution, easily deploy virtual
desktops for less than the cost of PCs and save 60% on VDI infrastructure
costs. Try it free! http://p.sf.net/sfu/Citrix-VDIinabox
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general