I think I've figured out what the problem is, but someone familiar with the code should confirm. I think SVC is always using a decision function based on support vectors, even though in the case of a linear kernel it is computationally cheaper to just do one dot product in feature space.
I determined this using the script below. If I change m, the number of training samples, then the prediction time changes, but the ratio of prediction time to number of support vectors remains more or less constant. import numpy as np from sklearn.svm import SVC import time rng = np.random.RandomState([1,2,3]) m = 1000 n = 1000 X = rng.randn(m,n) w = rng.randn(n) b = rng.randn(1) y = (np.dot(X,w) + b ) > 0 t1 = time.time() svm = SVC(kernel = 'linear', C = 1.0).fit(X,y) t2 = time.time() print 'train time ',t2 - t1 X2 = X t1 = time.time() y1 = svm.predict(X2) t2 = time.time() print 'predict time ',t2 - t1 print '# support vectors:',svm.n_support_ print 'predict time per support vector:',(t2-t1)/float(svm.n_support_.sum()) t1 = time.time() y2 = ( np.dot(X2, svm.coef_.T) + svm.intercept_ ) > 0 t2 = time.time() print 'dot product time',t2 -t1 print 'predict accuracy ',(y1 == y).mean() print 'dot product accuracy ',(y2 == y).mean() print 'predict and dot agreement rate',(y1 == y2).mean() ~ On Thu, May 24, 2012 at 10:09 AM, Ian Goodfellow <[email protected]> wrote: > I've noticed that on large datasets, it takes several minutes for SVC > to classify the dataset, when it should take under a second. > To debug this, I made a small test script where I time each step. > I found that the predict function for the linear kernel, is not doing > at all what I thought it would, which is return (X*coef_ + intercept_) >>0. > So, this raises a second question, what on earth is it doing? > But getting back to the first question, of the runtime, whatever it is > doing is way slower than (X*coef_ + intercept_) > 0. > Using the following script, I get that whatever the SVC is doing for > predict, it takes 1.1 seconds to classify 1,000 examples. Evaluating > the dot product and thresholding takes only 3e-3 seconds, but only > agrees with whatever the actual prediction function is 50% of the > time. > > > import numpy as np > from sklearn.svm import SVC > import time > > rng = np.random.RandomState([1,2,3]) > > X = rng.randn(1000,1000) > w = rng.randn(1000) > b = rng.randn(1) > y = (np.dot(X,w) + b ) > 0 > > t1 = time.time() > svm = SVC(kernel = 'linear', C = 1.0).fit(X,y) > t2 = time.time() > print 'train time ',t2 - t1 > > X2 = X#rng.randn(1000,1000) > > t1 = time.time() > y1 = svm.predict(X2) > t2 = time.time() > print 'predict time ',t2 - t1 > > t1 = time.time() > y2 = ( np.dot(X2, svm.coef_.T) + svm.intercept_ ) > 0 > t2 = time.time() > print 'dot product time',t2 -t1 > > print 'predict accuracy ',(y1 == y).mean() > print 'dot product accuracy ',(y2 == y).mean() > print 'predict and dot agreement rate',(y1 == y2).mean() ------------------------------------------------------------------------------ Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
