I've noticed that on large datasets, it takes several minutes for SVC
to classify the dataset, when it should take under a second.
To debug this, I made a small test script where I time each step.
I found that the predict function for the linear kernel, is not doing
at all what I thought it would, which is return (X*coef_ + intercept_)
>0.
So, this raises a second question, what on earth is it doing?
But getting back to the first question, of the runtime, whatever it is
doing is way slower than (X*coef_ + intercept_) > 0.
Using the following script, I get that whatever the SVC is doing for
predict, it takes 1.1 seconds to classify 1,000 examples. Evaluating
the dot product and thresholding takes only 3e-3 seconds, but only
agrees with whatever the actual prediction function is 50% of the
time.


import numpy as np
from sklearn.svm import SVC
import time

rng = np.random.RandomState([1,2,3])

X = rng.randn(1000,1000)
w = rng.randn(1000)
b = rng.randn(1)
y = (np.dot(X,w) + b ) > 0

t1 = time.time()
svm = SVC(kernel = 'linear', C = 1.0).fit(X,y)
t2 = time.time()
print 'train time ',t2 - t1

X2 = X#rng.randn(1000,1000)

t1 = time.time()
y1 = svm.predict(X2)
t2 = time.time()
print 'predict time ',t2 - t1

t1 = time.time()
y2 = ( np.dot(X2, svm.coef_.T) + svm.intercept_ ) > 0
t2 = time.time()
print 'dot product time',t2 -t1

print 'predict accuracy ',(y1 == y).mean()
print 'dot product accuracy ',(y2 == y).mean()
print 'predict and dot agreement rate',(y1 == y2).mean()

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to