One other thing to consider is that you may not need the full millions of
examples to explore the decision space for tuning hyperparameters, choosing
kernels, etc. when building the model. You may try randomly subsampling
(maybe 10k-100k samples is enough? dependent on your dataset) the data and
training/testing to get a feel for how each of your choices for classifier,
polynomial features, etc. will affect your performance. This will have the
dual benefit of taking less time when you are trying ideas, and you know
that the long run over all the data (.predict() with your final model,
which *could* be trained on a random subset, or full data) will probably be
worthwhile.
Depending on your dataset, Naive Bayes may also be a decent choice to try.
On Thu, Feb 20, 2014 at 7:36 AM, Mathieu Blondel <math...@mblondel.org>wrote:
> With millions of samples, LinearSVC or SGDClassifier are more appropriate.
> However, they only support the linear kernel. Since you have only 5
> features, I think it would be worth trying non-linear features. You can try
> the kernel approximation module [1] and PolynomialFeatures [2]
>
>
> http://scikit-learn.org/dev/modules/kernel_approximation.html#kernel-approximation
>
> https://github.com/scikit-learn/scikit-learn/blob/master/examples/linear_model/plot_polynomial_interpolation.py(only
> in master)
>
> HTH,
> Mathieu
>
>
> On Thu, Feb 20, 2014 at 9:20 PM, Tommy Carstensen <t...@sanger.ac.uk>wrote:
>
>> To scikit-learn-general,
>>
>> I am trying to do a binary classification (true/false) of millions of
>> samples across 5 features with SVM. How many samples should I use for
>> building my model? I tried using svm.SVC().fit() on hundreds of
>> thousands of samples, but it ran for more than 12 hours. I am quite new
>> to machine learning, so any help provided will be much appreciated.
>> Thank you.
>>
>> P.S. I am not sure, if this is the appropriate forum. Please ignore my
>> question, if it does not belong on this mailing list.
>>
>> Best wishes,
>> Tommy
>>
>>
>>
>>
>> --
>> The Wellcome Trust Sanger Institute is operated by Genome Research
>> Limited, a charity registered in England with number 1021457 and a
>> company registered in England with number 2742969, whose registered
>> office is 215 Euston Road, London, NW1 2BE.
>>
>>
>> ------------------------------------------------------------------------------
>> Managing the Performance of Cloud-Based Applications
>> Take advantage of what the Cloud has to offer - Avoid Common Pitfalls.
>> Read the Whitepaper.
>>
>> http://pubads.g.doubleclick.net/gampad/clk?id=121054471&iu=/4140/ostg.clktrk
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>
>
>
> ------------------------------------------------------------------------------
> Managing the Performance of Cloud-Based Applications
> Take advantage of what the Cloud has to offer - Avoid Common Pitfalls.
> Read the Whitepaper.
>
> http://pubads.g.doubleclick.net/gampad/clk?id=121054471&iu=/4140/ostg.clktrk
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
------------------------------------------------------------------------------
Managing the Performance of Cloud-Based Applications
Take advantage of what the Cloud has to offer - Avoid Common Pitfalls.
Read the Whitepaper.
http://pubads.g.doubleclick.net/gampad/clk?id=121054471&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general