2012/7/25 Abhi <[email protected]>:
>
> Hello,
>     Sorry for getting back late..I originally had experimented with different
> classifiers including SGDClassifier, it seemed faster but much less accurate,
> about 93% for 30000 emails[and decreasing as the number of emails increases],
> but have not tried with the incremental approach. Will try it next.

The incremental approach should not change the outcome. It will just
make it possible not to load all of the data ahead of time. Have a
look at the sklearn/linear_model/sgd.py source code to know exactly
what the difference is.

For the accuracy issue I suspect that you used the default n_iter
value? Maybe you can try to increase or reduce that value. Also you
can try alternative parameters for the learning rate (see the doc for
more details).

http://scikit-learn.org/stable/modules/sgd.html

>    During this time I have been facing seg faults in LinearSVC. If I use bi-
> grams in the vectorizer,the memory usage increases to more than double, and I
> get a segfault in classifier.fit. I tried reducing the number of features so 
> as
> to reduce the size using SelectKBest, (as shown in http://scikit-
> learn.org/stable/auto_examples/document_classification_20newsgroups.html)
>
> From the test run:
> I use TfidfVectorizer to extract features from test and training dataset.
>
> [Train data] n_samples: 47237, n_features: 3118889
> [Test data] n_samples: 23974, n_features: 3118889
>
> After which I am selecting the k-best features, but get a segfault at
>         ch2 = SelectKBest(chi2, k=500)
> -->     data_train = ch2.fit_transform(data_train, self.train_target)
>
>   I used this method since I read somewhere [I came accross the reference, but
> did forgot to mark the link], that chi-squared test would be best for
> extracting best of sparse features. Does that or my approach seem correct?
> Thank you for the responses and your valuable input,

Could you please try to come up with one or two minimalistic
reproduction scripts for the ch2.fit_transform and LinearSVC.fit
segfaults? Is it just that it is exhausting memory on your system? Are
you running a 32bit or a 64bit OS? How much physical memory do you
have on your machine? You can push the scripts on
https://gist.github.com (note those are git repositories so that you
can push data files using git there too).

As a temporary workaround you can pass a max_features parameter to the
vectorizer so as to limit the feature mapping to the top most frequent
features over the training corpus. For instance you can try values in
the range max_features=100000 or max_feature=1000000.

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to