I've noticed that on large datasets, it takes several minutes for SVC to classify the dataset, when it should take under a second. To debug this, I made a small test script where I time each step. I found that the predict function for the linear kernel, is not doing at all what I thought it would, which is return (X*coef_ + intercept_) >0. So, this raises a second question, what on earth is it doing? But getting back to the first question, of the runtime, whatever it is doing is way slower than (X*coef_ + intercept_) > 0. Using the following script, I get that whatever the SVC is doing for predict, it takes 1.1 seconds to classify 1,000 examples. Evaluating the dot product and thresholding takes only 3e-3 seconds, but only agrees with whatever the actual prediction function is 50% of the time.
import numpy as np from sklearn.svm import SVC import time rng = np.random.RandomState([1,2,3]) X = rng.randn(1000,1000) w = rng.randn(1000) b = rng.randn(1) y = (np.dot(X,w) + b ) > 0 t1 = time.time() svm = SVC(kernel = 'linear', C = 1.0).fit(X,y) t2 = time.time() print 'train time ',t2 - t1 X2 = X#rng.randn(1000,1000) t1 = time.time() y1 = svm.predict(X2) t2 = time.time() print 'predict time ',t2 - t1 t1 = time.time() y2 = ( np.dot(X2, svm.coef_.T) + svm.intercept_ ) > 0 t2 = time.time() print 'dot product time',t2 -t1 print 'predict accuracy ',(y1 == y).mean() print 'dot product accuracy ',(y2 == y).mean() print 'predict and dot agreement rate',(y1 == y2).mean() ------------------------------------------------------------------------------ Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
