Andreas Müller <amueller@...> writes:

> 
> Hi Fred.
> As each sample is used individually and the weights updated after each 
> sample, 
it doesn't matter.
> If you pass very small "batches", the overhead of calling the fitting 
> function 
is probably bigger, though.
> Cheers,
> Andy
> 
> ----- Ursprüngliche Mail -----
> Von: "Fred Mailhot" <fred.mailhot <at> gmail.com>
> An: scikit-learn-general <at> lists.sourceforge.net
> Gesendet: Samstag, 14. Juli 2012 22:14:51
> Betreff: Re: [Scikit-learn-general] Online learning
> 
> On 14 July 2012 04:22, Olivier Grisel < olivier.grisel <at> ensta.org > 
> wrote: 
> 
> 2012/7/13 Abhi < kolhe_abhi <at> yahoo.co.in >: 
> > Hello, 
> > My problem is to classify a set of 200k+ emails into approx. 2800 
categories. 
> > Currently the method I am using is calculating tfidf and using LinearSVC() 
> > [with a good accuracy of 98%] for classification. The training time is ~30-
60 
> > min [~16g of mem, and doubles every 75000 mails]. I was wondering what 
> > would 
be 
> > the best way to introduce online learning in my current model? [And I am 
worried 
> > about how this solution would scale, especially since the number of 
categories 
> > is unbounded, or is definitely going to increase over time]. I do not have 
much 
> > experience with scikit, so have not explored all the paths, but if I am 
missing 
> > anything any help, suggestions would be appreciated. 
> 
> LinearSVC is based on liblinear that only implements batch 
> optimization. Instead you can use SGDClassifier that features 
> partial_fit method that you can call several consecutive times on 
> chunks of data for incremental learning. 
> 


Hello,
    Sorry for getting back late..I originally had experimented with different 
classifiers including SGDClassifier, it seemed faster but much less accurate, 
about 93% for 30000 emails[and decreasing as the number of emails increases], 
but have not tried with the incremental approach. Will try it next.
   During this time I have been facing seg faults in LinearSVC. If I use bi-
grams in the vectorizer,the memory usage increases to more than double, and I 
get a segfault in classifier.fit. I tried reducing the number of features so as 
to reduce the size using SelectKBest, (as shown in http://scikit-
learn.org/stable/auto_examples/document_classification_20newsgroups.html)

From the test run:
I use TfidfVectorizer to extract features from test and training dataset.

[Train data] n_samples: 47237, n_features: 3118889
[Test data] n_samples: 23974, n_features: 3118889

After which I am selecting the k-best features, but get a segfault at         
        ch2 = SelectKBest(chi2, k=500)
-->     data_train = ch2.fit_transform(data_train, self.train_target)
 
  I used this method since I read somewhere [I came accross the reference, but 
did forgot to mark the link], that chi-squared test would be best for 
extracting best of sparse features. Does that or my approach seem correct?
Thank you for the responses and your valuable input,
Abhi






------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to