2013/10/18 Lars Buitinck <larsm...@gmail.com>:
> 2013/10/18 Nigel Legg <nigel.l...@gmail.com>:
>> What am I doing wrong here?
>
> Could be lots of things. In any case, using an untuned SVC for this
> task is a bad idea because (a) you need to tune it and (b) it's an
> SVC. Better try LinearSVC or SGDClassifier.

Indeed, SVC is using a RBF kernel by default which is not well suited
for text classification. A linear model is often much better (and much
faster to train) for sparse very high-dimensional data such as text
data.

Also you should never expect the classifiers to work correctly with
the default parameters values. You have to grid search (manually or
automatically with GridSearchCV) for the most important parameters,
typically the regularizer strength for linear model such as LinearSVC
(the C parameter) and SGDClassifier (the alpha parameter).

Have a look at the document classification example for models and
range of parameter values that work on text classification:

http://scikit-learn.org/stable/auto_examples/document_classification_20newsgroups.html

-- 
Olivier

------------------------------------------------------------------------------
October Webinars: Code for Performance
Free Intel webinars can help you accelerate application performance.
Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from 
the latest Intel processors and coprocessors. See abstracts and register >
http://pubads.g.doubleclick.net/gampad/clk?id=60135031&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to