On 14 July 2012 04:22, Olivier Grisel <[email protected]> wrote:

> 2012/7/13 Abhi <[email protected]>:
> > Hello,
> >    My problem is to classify a set of 200k+ emails into approx. 2800
> categories.
> >  Currently the method I am using is calculating tfidf and using
> LinearSVC()
> >  [with a good accuracy of 98%] for classification. The training time is
> ~30-60
> >  min [~16g of mem, and doubles every 75000 mails]. I was wondering what
> would be
> >  the best way to introduce online learning in my current model? [And I
> am worried
> >  about how this solution would scale, especially since the number of
> categories
> >  is unbounded, or is definitely going to increase over time]. I do not
> have much
> >  experience with scikit, so have not explored all the paths, but if I am
> missing
> >  anything any help, suggestions would be appreciated.
>
> LinearSVC is based on liblinear that only implements batch
> optimization. Instead you can use SGDClassifier that features
> partial_fit method that you can call several consecutive times on
> chunks of data for incremental learning.
>
> --
> Olivier
> http://twitter.com/ogrisel - http://github.com/ogrisel
>

Does partial_fit() prefer large or small chunks? For example, could I use
it to train a classifier with O(10^4) samples, then continue training with
successive batches of, say, 10 or 100, or is it better to train with
similarly-sized batches throughout?

Thanks,
Fred.
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to