I think I've figured out what the problem is, but someone familiar
with the code should confirm.
I think SVC is always using a decision function based on support
vectors, even though in the case of a linear kernel it is
computationally cheaper to just do one dot product in feature space.

I determined this using the script below. If I change m, the number of
training samples, then the prediction time changes, but the ratio of
prediction time to number of support vectors remains more or less
constant.




import numpy as np
from sklearn.svm import SVC
import time

rng = np.random.RandomState([1,2,3])

m = 1000
n = 1000

X = rng.randn(m,n)
w = rng.randn(n)
b = rng.randn(1)
y = (np.dot(X,w) + b ) > 0

t1 = time.time()
svm = SVC(kernel = 'linear', C = 1.0).fit(X,y)
t2 = time.time()
print 'train time ',t2 - t1

X2 = X

t1 = time.time()
y1 = svm.predict(X2)
t2 = time.time()
print 'predict time ',t2 - t1
print '# support vectors:',svm.n_support_
print 'predict time per support vector:',(t2-t1)/float(svm.n_support_.sum())

t1 = time.time()
y2 = ( np.dot(X2, svm.coef_.T) + svm.intercept_ ) > 0
t2 = time.time()
print 'dot product time',t2 -t1

print 'predict accuracy ',(y1 == y).mean()
print 'dot product accuracy ',(y2 == y).mean()
print 'predict and dot agreement rate',(y1 == y2).mean()
~



On Thu, May 24, 2012 at 10:09 AM, Ian Goodfellow
<[email protected]> wrote:
> I've noticed that on large datasets, it takes several minutes for SVC
> to classify the dataset, when it should take under a second.
> To debug this, I made a small test script where I time each step.
> I found that the predict function for the linear kernel, is not doing
> at all what I thought it would, which is return (X*coef_ + intercept_)
>>0.
> So, this raises a second question, what on earth is it doing?
> But getting back to the first question, of the runtime, whatever it is
> doing is way slower than (X*coef_ + intercept_) > 0.
> Using the following script, I get that whatever the SVC is doing for
> predict, it takes 1.1 seconds to classify 1,000 examples. Evaluating
> the dot product and thresholding takes only 3e-3 seconds, but only
> agrees with whatever the actual prediction function is 50% of the
> time.
>
>
> import numpy as np
> from sklearn.svm import SVC
> import time
>
> rng = np.random.RandomState([1,2,3])
>
> X = rng.randn(1000,1000)
> w = rng.randn(1000)
> b = rng.randn(1)
> y = (np.dot(X,w) + b ) > 0
>
> t1 = time.time()
> svm = SVC(kernel = 'linear', C = 1.0).fit(X,y)
> t2 = time.time()
> print 'train time ',t2 - t1
>
> X2 = X#rng.randn(1000,1000)
>
> t1 = time.time()
> y1 = svm.predict(X2)
> t2 = time.time()
> print 'predict time ',t2 - t1
>
> t1 = time.time()
> y2 = ( np.dot(X2, svm.coef_.T) + svm.intercept_ ) > 0
> t2 = time.time()
> print 'dot product time',t2 -t1
>
> print 'predict accuracy ',(y1 == y).mean()
> print 'dot product accuracy ',(y2 == y).mean()
> print 'predict and dot agreement rate',(y1 == y2).mean()

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to