Olivier, thank you for your feedback and the other answer from Kyle and
Mathieu. My dataset of ~25 million samples is quite balanced (45-55%
positive). I randomly subsampled to 100,000 samples for the
sklearn.svm.libsvm.fit step. I ran a CPU profiler, and the time
consuming step is sklearn.svm.libsvm.predict. In a week or two I will
have the data enabling me to compare the sensitivity of various
approaches. But I can already tell that my current SVM approach filters
out too many true positives. This is quite possibly, because I provide
the fitting step with a false training set containing too many
positives.

Tommy

-----Original Message-----
From: Olivier Grisel [mailto:olivier.gri...@ensta.org] 
Sent: 24 February 2014 09:06
To: scikit-learn-general
Cc: Deepti Gurdasani
Subject: Re: [Scikit-learn-general] SVM, appropriate size of training
set

Is your dataset balanced (roughly as many positive as negative)?

Kernel SVMs as implemented in scikit-learn do not scale with the number
of samples: the computational cost is more than quadratic wrt n_samples.
Either subsample (especially if you have a large imbalance), use an
approximation such as Nystroem [1] feature expansion + linear model or
use a more scalable non-linear algorithm such as
RandomForestsClassifier.

[1] http://scikit-learn.org/stable/modules/kernel_approximation.html

--
Olivier

------------------------------------------------------------------------
------
Flow-based real-time traffic analytics software. Cisco certified tool.
Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer
Customize your own dashboards, set traffic alerts and generate reports.
Network behavioral analysis & security monitoring. All-in-one tool.
http://pubads.g.doubleclick.net/gampad/clk?id=126839071&iu=/4140/ostg.cl
ktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 

------------------------------------------------------------------------------
Flow-based real-time traffic analytics software. Cisco certified tool.
Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer
Customize your own dashboards, set traffic alerts and generate reports.
Network behavioral analysis & security monitoring. All-in-one tool.
http://pubads.g.doubleclick.net/gampad/clk?id=126839071&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to