Olivier Grisel <olivier.grisel@...> writes:

> 
> 2012/7/25 Abhi <kolhe_abhi@...>:
> >
> > Hello,
> >     Sorry for getting back late..I originally had experimented with 
different
> > classifiers including SGDClassifier, it seemed faster but much less 
accurate,
> > about 93% for 30000 emails[and decreasing as the number of emails 
increases],
> > but have not tried with the incremental approach. Will try it next.
> 
> The incremental approach should not change the outcome. It will just
> make it possible not to load all of the data ahead of time. Have a
> look at the sklearn/linear_model/sgd.py source code to know exactly
> what the difference is.
> 
> For the accuracy issue I suspect that you used the default n_iter
> value? Maybe you can try to increase or reduce that value. Also you
> can try alternative parameters for the learning rate (see the doc for
> more details).
> 
> http://scikit-learn.org/stable/modules/sgd.html
> 
> >    During this time I have been facing seg faults in LinearSVC. If I use bi-
> > grams in the vectorizer,the memory usage increases to more than double, and 
I
> > get a segfault in classifier.fit. I tried reducing the number of features 
> > so 
as
> > to reduce the size using SelectKBest, (as shown in http://scikit-
> > learn.org/stable/auto_examples/document_classification_20newsgroups.html)
> >
> > From the test run:
> > I use TfidfVectorizer to extract features from test and training dataset.
> >
> > [Train data] n_samples: 47237, n_features: 3118889
> > [Test data] n_samples: 23974, n_features: 3118889
> >
> > After which I am selecting the k-best features, but get a segfault at
> >         ch2 = SelectKBest(chi2, k=500)
> > -->     data_train = ch2.fit_transform(data_train, self.train_target)
> >
> >   I used this method since I read somewhere [I came accross the reference, 
but
> > did forgot to mark the link], that chi-squared test would be best for
> > extracting best of sparse features. Does that or my approach seem correct?
> > Thank you for the responses and your valuable input,
> 
> Could you please try to come up with one or two minimalistic
> reproduction scripts for the ch2.fit_transform and LinearSVC.fit
> segfaults? Is it just that it is exhausting memory on your system? Are
> you running a 32bit or a 64bit OS? How much physical memory do you
> have on your machine? You can push the scripts on
> https://gist.github.com (note those are git repositories so that you
> can push data files using git there too).
> 
> As a temporary workaround you can pass a max_features parameter to the
> vectorizer so as to limit the feature mapping to the top most frequent
> features over the training corpus. For instance you can try values in
> the range max_features=100000 or max_feature=1000000.
> 

     For LinearSVC with unigrams it works fine, but if I use bigrams 
(max_n=2 in TfidfVectorizer) I get the segfault. I am on 64 bit Centos, Python
 2.6. The memory usage for LinearSVC (at LinearSVC.fit) was 
 with unigrams: ~9G, and 
 with bigrams: ~16G for about 15-20 min, when it suddenly spikes and I 
               get a segfault.
For SGD I had tried with default value and n_iter=50 originally.
Trying out diferent combinations I get best accuraxy ~98% with
SGDClassifier(n_iter=10, loss='modified_huber')
Adding max_features=500000 worked good. As for the scripts, I will post them 
soon.
Thanks.




------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to