Olivier, thank you for your feedback and the other answer from Kyle and Mathieu. My dataset of ~25 million samples is quite balanced (45-55% positive). I randomly subsampled to 100,000 samples for the sklearn.svm.libsvm.fit step. I ran a CPU profiler, and the time consuming step is sklearn.svm.libsvm.predict. In a week or two I will have the data enabling me to compare the sensitivity of various approaches. But I can already tell that my current SVM approach filters out too many true positives. This is quite possibly, because I provide the fitting step with a false training set containing too many positives.
Tommy -----Original Message----- From: Olivier Grisel [mailto:olivier.gri...@ensta.org] Sent: 24 February 2014 09:06 To: scikit-learn-general Cc: Deepti Gurdasani Subject: Re: [Scikit-learn-general] SVM, appropriate size of training set Is your dataset balanced (roughly as many positive as negative)? Kernel SVMs as implemented in scikit-learn do not scale with the number of samples: the computational cost is more than quadratic wrt n_samples. Either subsample (especially if you have a large imbalance), use an approximation such as Nystroem [1] feature expansion + linear model or use a more scalable non-linear algorithm such as RandomForestsClassifier. [1] http://scikit-learn.org/stable/modules/kernel_approximation.html -- Olivier ------------------------------------------------------------------------ ------ Flow-based real-time traffic analytics software. Cisco certified tool. Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer Customize your own dashboards, set traffic alerts and generate reports. Network behavioral analysis & security monitoring. All-in-one tool. http://pubads.g.doubleclick.net/gampad/clk?id=126839071&iu=/4140/ostg.cl ktrk _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. ------------------------------------------------------------------------------ Flow-based real-time traffic analytics software. Cisco certified tool. Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer Customize your own dashboards, set traffic alerts and generate reports. Network behavioral analysis & security monitoring. All-in-one tool. http://pubads.g.doubleclick.net/gampad/clk?id=126839071&iu=/4140/ostg.clktrk _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general